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When  data  are  replicated  in  a  distributed  system,  it  is  necessary  to  ensure 
that  different  copies  of  the  same  data  are  kept  consistent  with  respect  to  one 
another.  This  could  lead  to  a  substantial  performance  degradation  (latency)  when 
operations  are  performed  on  replicated  data.  Even  if  a  copy  of  the  required  data  is 
available  at  the  site  where  an  operation  is  performed,  it  may  be  necessary  to  com¬ 
municate  with  other  sites  where  copies  of  the  same  data  reside,  in  order  to  ensure 
that  consistency  is  not  violated.  Performance  suffers  because  message  transmission 
times  are  typically  much  higher  than  local  computation  times. 

In  this  thesis,  we  first  present  an  object-based  model  of  a  distributed  system. 
We  then  use  this  model  to  show  that  data  can  be  replicated  in  a  manner  that  does 
not  incur  the  latency  cost  described  above.  Further,  we  prove  that  it  is  possible  to 
transmit  the  information  required  to  maintain  consistency  by  piggybacking  it  upon 
synchronization  messages  that  would  be  sent  even  if  data  were  not  replicated.  The 
replication  method  hence  does  not  require  any  additional  messages.  We  describe 
actual  implementations  of  this  method  and  discuss  their  behavior  in  conjunction 
with  roll-back  and  roll-forward  failure  handling  mechanisms.  Finally,  we  compare 
the  performance  of  one  such  implementation  with  a  more  synchronous  implementa¬ 
tion  and  demonstrate  that  our  method  performs  substantially  better 
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CHAPTER  1 


Introduction 

1.1.  Distributed  Computer  Systems 

Early  computer  systems  consisted  of  a  powerful  centralized  computer  accessed 
by  means  of  relatively  unsophisticated  terminals.  As  a  result  of  advances  in  com¬ 
puter  hardware  and  the  development  of  the  personal  computer  or  work-station,  it  is 
now  possible  to  place  upon  one’s  desk-top  the  processing  power  that  used  to  be 
available  only  in  large  mainframe  computers.  This  has  led  to  the  growth  of  distri¬ 
buted  systems.  A  distributed  system  consists  of  a  number  of  independent  processing 
sites,  interconnected  by  means  of  a  communications  network.1  Each  site  possesses 
memory  units  to  store  data  and  processing  units  to  access  and  manipulate  data 
stored  at  that  site.  The  network  enables  sites  to  send  messages  to  one  another  and 
can  be  used  to  access  data  stored  at  a  remote  site. 

A  distributed  system  has  a  number  of  advantages  over  a  centralized  one.  In  a 
distributed  system,  each  user  can  be  provided  with  the  processing  power  of  a 
separate  computer  at  relatively  low  cost.  Users  can  share  common  services  like  a 
file  system,  specialized  databases,  or  high  quality  output  devices,  which  may  be 
located  at  remote  sites  and  accessed  using  the  network.  Thus,  the  cost  of  such  ser¬ 
vices  is  amortized  over  all  the  sites.  A  distributed  system  also  has  performance 

^he  results  in  this  work  are  equally  valid  if  the  term  site  is  replaced  by  pro¬ 
cess  at  a  site,  provided  that  processes  communicate  by  sending  messages  to  one 
another  and  share  no  memory. 
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advantages.  In  a  centralized  system,  all  operations  are  carried  out  at  the  same 
site.  Hence,  the  central  site  must  be  flexible  enough  to  respond  to  different  types  of 
requests  and  cannot  be  optimized  for  a  single  type  of  operation.  In  a  distributed 
system,  on  the  other  hand,  sues  can  have  different  capabilities  and  system  perfor¬ 
mance  can  be  improved  by  executing  operations  at  sites  best  suited  to  do  so. 
Further,  if  the  work-load  is  uneven,  bottlenecks  can  form  in  a  centralized  system 
during  times  of  peak  load,  with  the  system  remaining  relatively  idle  at  other  times. 
In  a  distributed  system,  the  work-load  can  be  spread  out  more  evenly  by  assigning 
jobs  to  lightly  loaded  sites.  Another  advantage  of  distributed  systems  is  that  they 
are  easy  to  expand:  more  sites  can  be  added  as  required.  A  distributed  system  is 
also  potentially  more  robust  than  a  centralized  one  because  if  a  site  fails,  the  other 
sites  can  continue  to  function.  In  addition,  operational  sites  can  assume  some  of 
the  work  that  would  have  been  performed  at  a  failed  site.  For  this  to  be  possible, 
information  stored  at  the  failed  site  must  be  available  to  an  operational  one.  This 
motivates  the  need  for  replicated  data. 

1.2.  Replicated  data 

A  distributed  system  permits  copies  of  the  same  information  to  be  stored  at 
more  than  one  site.  If  a  data  item  is  accessed  frequently  from  a  set  of  sites,  a  copy 
can  be  stored  at  each  of  these  sites,  and  each  site  could  access  its  local  copy.  Since 
the  time  required  for  a  message  to  be  sent  between  sites  is  typically  much  higher 
than  that  required  for  local  processing,  accessing  a  local  copy  of  data  in  a  distri¬ 
buted  system  is  much  faster  than  accessing  data  stored  at  a  remote  site.  Repli¬ 
cated  data  can  also  be  used  to  achieve  fault-tolerance.  If  a  site  fails,  data  stored  at 
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that  site  may  become  unavailable  to  the  rest  of  the  system.  However,  if  data  are 
replicated,  the  system  can  continue  operating  by  accessing  another  copy  of  the  data 
instead. 

Replicated  data,  however,  is  not  without  its  costs.  The  obvious  ones  are  the 
costs  of  storing  more  than  one  copy  of  the  same  data  and  the  processing  costs 
involved  with  updating  several  copies  instead  of  one.  These  costs  are  unavoidable. 
Other  costs  arise  from  the  need  to  keep  different  copies  of  replicated  data  con¬ 
sistent.  When  one  copy  is  changed,  the  other  copies  must  be  updated  to  reflect  this 
change.  Otherwise,  actions  taken  by  a  site  based  on  one  copy  of  replicated  data 
may  not  in  agreement  with  actions  taken  by  a  different  site  based  on  another  copy. 
A  simple  example  will  illustrate  this  problem. 

Consider  a  replicated  data  object  that  represents  airline  seat  reservations.  If 
one  copy  of  the  object  is  updated  whenever  a  reservation  is  made,  but  the  change  is 
not  propagated  to  the  other  copies,  travel  agents  using  different  copies  may  make 
more  reservations  than  there  are  seats  available,  or  assign  the  same  seat  to 
different  passengers.  This  outcome  could  be  embarrassing  to  the  airline  and  incon¬ 
venient  for  the  passengers.  In  this  example,  the  copies  of  the  replicated  data  object 
are  inconsistent. 

The  cost  of  keeping  copies  of  replicated  data  consistent  manifests  itself  in  two 
ways.  First,  the  time  taken  to  update  replicated  data  is  greater  than  that  for  non- 
replicated  data.  We  call  this  effect  latency.  Compare  a  non-replicated  implementa¬ 
tion  of  the  airline  seat  reservation  object  with  a  replicated  one,  In  the  former  case, 
the  time  taken  for  an  update  by  the  site  where  the  object  is  stored  is  simply  the 


time  required  for  local  processing.  On  the  other  hand,  if  a  user  at  a  remote  site 
wishes  to  perform  an  update,  he  or  she  must  wait  for  a  message  transmission  to 
take  place  between  the  sites.  In  the  replicated  case,  an  update  made  by  any  site 
must  be  propagated  to  other  sites.  Moreover,  an  update  u  cannot  be  performed 
until  all  the  sites  agree  that  it  is  safe  to  do  so;  that  is,  there  are  no  ongoing  updates 
at  remote  sites  that  must  be  completed  everywhere  before  u.  Thus,  depending  on 
the  implementation,  the  time  required  to  perform  an  update  in  the  replicated  case 
could  be  as  long  as  the  time  required  to  perform  a  remote  update  in  the  non- 
replicated  case.  As  a  result,  the  response  time  when  updating  replicated  data  will 
be  greater  than  that  for  updating  non- replicated  data,  even  when  a  copy  is  avail¬ 
able  locally.  This  latency  could  be  avoided  by  decoupling  remote  updates  from  local 
ones,  but  this  must  be  done  in  a  way  that  maintains  consistency. 

The  problem  of  decoupling  remote  updates  from  local  ones  has  been  addressed 
in  the  context  of  replicated  databases,  where  only  read  and  write  operations  are 
possible  on  the  data.  Traiger  et  al.  show  that  when  two-phase  locking  is  used, 
remote  writes  can  be  deferred  until  commit  time  without  affecting  the  consistency 
of  data  [45].  Eager  and  Sevcik  describe  a  concurrency  control  method  in  which 
transactions  are  executed  locally,  with  write  operations  propagated  to  remote  sites 
later  [17].  In  [28],  we  show  how  to  decouple  remote  updates  in  a  database  system 
that  uses  any  concurrency  control  algorithm  following  a  read-one-copy,  write-all- 
copies  rule.  In  this  work,  we  generalize  that  result  even  further  -  beyond  database 
systems  -  to  systems  with  arbitrary  types  of  data  and  more  complex  operations 


than  reads  and  writes. 


The  other  cost  that  arises  from  the  need  to  maintain  consistency  is  an  increase 
in  message  traffic.  This  follows  from  the  fact  that  information  about  updates  must 
be  transmitted  between  sites  where  copies  of  replicated  data  are  situated.  In  many 
distributed  systems,  the  critical  cost  factor  is  the  number  of  messages  sent,  and  not 
their  size.  This  is  especially  prevalent  when  each  message  is  processed  by  a  large 
number  of  software  layers  before  being  physically  transmitted  or  after  being 
received.  In  such  systems*  each  message  sent  has  a  relatively  high  fixed  cost  and  a 
smaller  variable  cost  that  depends  on  its  size.  Therefore,  in  this  thesis,  we  meas¬ 
ure'  the  cost  of  communication  by  the  number  of  messages  transmitted,  rather  than 
the  total  amount  of  information  sent.  This  cost  measure  can  be  justified  on  the 
basis  of  studies  such  as  [11]. 


1.3.  Overview  of  the  thesis 


The  aim  of  this  work  is  to  develop  a  method  for  managing  replicated  data  that 
eliminates  latency.  Thus,  our  work  can  be  viewed  as  a  generalization  of  the  results 
in  [17,  28,  45Lto  systems  with  arbitrary  types  of  data,  with  operations  more  power¬ 
ful  than  reads  and  writes,  and  with  other  kinds  of  synchronization  mechanisms 
than  two-phase  locking.  Further,  we  show  that  it  is  possible  to  transmit  the  infor¬ 
mation  required  to  maintain  consistency  using  messages  that  would  be  sent 
between  sites  even  if  data  were  not  replicated.  In  other  words,  the  replication 
method  requires  no  additional  messages.  The  implication  is  that  of  the  costs  of 
replication  described  above,  the  only  costs  that  need  be  incurred  are  the  unavoid¬ 


able  costs  of  additional  storage  and  local  processing. 


The  thesis  is  organized  as  follows.  In  the  next  chapter,  we  present  £  model  for 
distributed  systems.  The  model  was  motivated  by  the  use  of  logs  to  model  database 
systems,  extended  to  abstract  data  types  as  in  [46],  and  treats  replicated  data  in  a 
manner  similar  to  [25].  Chapter  3  focuses  on  one  component  of  the  model  —  the 
scheduler.  The  concept  of  schedulers  for  database  systems  is  generalized  to  cover 
systems  with  arbitrary  types  of  data  and  operations.  The  notions  of  conflicting 
operations  and  classes  of  serializable  schedules  are  accordingly  generalized.  In 
Chapter  4,  the  implementation  another  component  of  the  model  —  distributed 
objects  —  is  discussed.  Chapter  5  uses  a  result  from  Chapter  4  to  develop  two 

,  4>r>  -T~  '  .  . 

methods  for  managing  replicated  data  that  realize  the  goals  described  above.  Since 
one  of  the  reasons  for  replicating  data  is  to  tolerate  failures,  failure  handling 
mechanisms  are  discussed  in  Chapter  7.  We  implemented  one  of  the  replication 
methods  described  in  Chapter  5  and  measured  its  performance.  The  results  of 
these  tests  are  presented  in  Chapter  7.  The  last  chapter  summarizes  the  work  in 


this  thesis  and  indicates  future  directions. 


CHAPTER  2 


Model 


2.1.  Introduction 

The  major  portion  of  this  chapter  is  devoted  to  developing  a  model  for  distri¬ 
buted  systems.  We  are  primarily  interested  in  distributed  systems  consisting  of  a 
cluster  of  computers  or  work-stations,  connected  „o  one  another  by  a  high 
bandwidth  local-area  network  like  an  Ethernet  [36].  We  assume  that  the  sites  are 
functionally  equivalent;  that  is,  they  are  capable  of  performing  the  same  types  of 
operations,  though  some  sites  may  be  more  efficient  at  some  operations  than  others. 
We  further  assume  that  the  system  is  asynchronous  -  the  relative  speeds  of  compu¬ 
tations  at  different  sites  and  the  times  taken  for  message  transmissions  are 
unpredictable.  Asynchronicity  also  implies  that  there  is  no  global  clock  or  shared 
memory  with  which  different  sites  can  coordinate  their  actions. 

One  aspect  of  a  distributed  system  that  we  model  is  the  synchronization 
between  events  at  different  sites.  Typically,  several  sites  in  a  distributed  system 
cooperate  to  solve  the  same  problem.  This  requires  that  events  at  different  sites  be 
coordinated  and  obey  certain  ordering  constraints.  The  problem  of  synchronizing 
concurrent  events  is  not  new;  it  has  been  studied  extensively  in  the  context  of 
operating  systems  [2,  16,  26].  However,  differences  arise  when  the  system  is  distri¬ 
buted  and  asynchronous.  Synchronization  based  on  a  common  clock  is  not  possible. 
Using  shared  memory  locations  for  synchronization  is  impractical  because  of  the 
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long  delays  associated  with  sending  messages  to  access  memory  at  a  remote  site. 
Furthermore,  methods  based  on  message  passing  must  be  modified  to  take  into 
account  the  fact  that  messages  between  sites  take  much  longer  than  messages 
between  processes  at  the  same  site.  An  important  observation  is  that  the  absence 
of  a  global  clock  means  that  any  synchronization  method  in  an  asynchronous  distri¬ 
buted  system  must  be  based  on  messages  sent  between  sites.  We  return  to  this  in 
Section  2.6. 

2.2.  Overview  of  the  system  model 

Our  model  of  a  distributed  system  is  motivated  by  the  use  of  logs  to  model 
database  systems  as  in  [5,  12,  18,  37,  38,  43].  In  these  models,  all  data  items  are  of 
the  same  type  -  they  have  a  single  value  that  can  be  accessed  using  a  read  opera¬ 
tion  or  overwritten  using  a  write  operation.  We  extend  the  idea  of  logs  to  abstract 
data  types  with  arbitrary  operations  as  in  [46].  The  extension  to  replicated  data  is 
done  in  a  manner  similar  to  [25].  We  make  use  of  terms  like  transaction  and  sen- 
alizabihty  from  database  theory,  but  these  terms  are  used  here  in  a  more  general 
context  than  is  standard. 

Our  model  consists  of  three  components:  distributed  objects,  transactions,  and 
the  scheduler.  Distributed,  objects  model  the  data-storage  and  manipulation  aspects 
of  a  distributed  system.  A  distributed  object  encapsulates  some  data  and  provides  a 
set  of  operations  to  access  and  alter  these  data.  The  act  of  causing  an  object  to  per¬ 
form  an  operation  on  its  data  is  called  an  invocation  of  the  object.1  When  an  object 
is  invoked,  it  provides  a  result  for  the  invocation.  The  result  depends  on  the 

lWe  use  the  terms  object  and  distributed  object  interchangeably. 
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operation,  the  value  of  the  arguments  for  the  invocation,  and  the  current  values  of 
the  object’s  data.  An  invocation  may  also  cause  the  values  of  the  object’s  data  to  be 
changed. 

Each  object  is  an  instance  of  an  object  type.  Objects  of  the  same  type  each  have 
their  own  copies  of  data,  but  provide  the  same  set  of  operations.  Finally,  associated 
with  each  object  O  is  a  set  of  sites  Accessible q  from  which  O  is  accessible.  O  can  be 
invoked  only  from  sites  in  AccessibleQ. 

As  an  example,  consider  an  object  of  type  integer.  Objects  of  this  type  encap¬ 
sulate  a  single  data  item:  an  integer  value.  The  operations  they  provide  are  read, 
which  can  be  used  to  obtain  the  current  value  of  the  integer,  and  write.  Invoking 
write!  z)  changes  the  value  of  the  integer  to  i  and  returns  ok()  when  the  operation 
is  completed.  Thus,  if  the  integer  object  Temparaturelnlthaca  is  invoked  to  per¬ 
form  write(-20),  the  value  of  Temperaturelnlthaca  would  be  changed  to  -20  and  the 
result  would  be  ok().  We  denote  this  as  [Temperaturelnlthaca. write(-20);  ok()].  If 
this  is  followed  by  an  invocation  of  the  read  operation  on  Temperaturelnlthaca ,  the 
result  would  be  ok(-20). 

More  complex  objects  are  possible.  An  example  is  an  object  of  type  queue, 
representing  an  ordered  list  of  records.  Objects  of  this  type  provide  the  operations 
insert,  first,  and  ListQueue.  Invoking  insert(r)  inserts  record  r  at  the  end  of  the 
queue,  obtaining  the  result  okl)  when  done.  The  invocation  first! )  returns  ok(r'  if 
the  queue  is  not  empty  and  r  is  the  first  record  in  the  queue.  If  the  queue  is 
empty,  the  result  returned  is  emptyO-  The  result  of  ListQueueO  is  a  list  of  all  the 
records  in  the  queue,  in  order. 
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Another  object  type  is  indexed_file,  with  operations  add,  remove,  lookup, 
and  ListFile.  An  object  of  this  type  represents  a  file  of  indexed  records.  Invoking 
add(t,  r)  adds  record  r  to  the  file  with  index  i,  and  removed)  removes  the  record 
with  index  i.  The  invocation  lookup(i)  returns  the  record  associated  with  index  i, 
if  it  exists  in  the  file.  If  no  such  record  exists,  the  result  returned  is  not_found(). 
ListFileO  returns  a  list  of  the  records  currently  in  the  file,  in  the  order  they  were 
added  to  the  file. 

Objects  in  a  distributed  system  are  likely  to  be  accessed  by  a  number  of  users 
concurrently.  Yet,  a  user  may  wish  to  perform  a  series  of  operations  on  one  or 
more  objects  without  their  executions  being  interrupted  by  another  user.  An  exam¬ 
ple  of  where  this  might  be  necessary  is  given  below. 

Consider  two  objects  Account  1  and  Account2  of  type  integer,  each  containing 
the  current  balance  in  a  bank  account.  Assume  that  they  initially  contain  $1000 
and  $2000  respectively,  and  that  user  A  wishes  to  transfer  $100  from  Accountl  to 
Accounts,  while  user  B  wishes  to  transfer  $50  from  Accounts  to  Accountl.  Let 
their  invocations  be  interleaved  as  shown  in  Figure  2.1.  We  see  that  although  each 
user  individually  performed  a  correct  sequence  of  invocations,  the  way  in  which 
their  invocations  were  interleaved  resulted  in  $2100  being  placed  in  Account 2  and 
$1050  in  Account  1.  This  is  clearly  an  incorrect  outcome  (especially  from  the  point 
of  view  of  the  bank).  The  system  has  been  made  inconsistent. 

The  system  would  not  be  inconsistent  if  all  the  invocations  of  user  A  had  pre¬ 
ceded  those  of  user  B  or  vice  versa.  In  general,  a  distributed  system  must  contain 
some  synchronization  mechanism  using  which  a  user  can  specify  that  a  series  of 


User  A 


User  B 


[Account l.readO;  ok($1000)] 

[Accounf2. readO;  ok($2000)] 
[Account  2.  read! );  ok($2000)] 

[Account  1. read!);  oki$1000)] 
[Account l.write<$900);  okO] 

[  Account  2.  writeiSl  950 1;  oktj] 
[A ccount'2. writei $ 2 1 00 );  ok( ) ] 

[Account  l.write<$  1050);  ok()] 

Figure  2.1.  Bank  account  example 


operations  be  executed  without  being  interrupted  by  the  execution  of  other  opera¬ 
tions.  A  user  does  this  by  means  of  transactions. 

A  transaction  is  simply  a  sequence  of  invocations  that  a  user  wishes  to  have 
executed  as  a  unit.  When  a  transaction  is  executed  in  isolation  on  a  system  in  a 
consistent  s^ate,  it  is  assumed  to  leave  the  system  in  a  consistent  state.  Thus,  in 
the  preceding  example,  the  sequence  of  invocations  performed  by  user  A  is  a  tran¬ 
saction,  as  is  the  sequence  performed  by  user  B.  The  system  may  interleave  the 
execution  of  invocations  from  different  transactions,  but  does  so  only  if  the  effect 
would  be  the  same  as  if  this  interleaving  did  not  occur.  As  a  result,  the  execution 
of  any  number  of  transactions  on  a  system  in  a  consistent  state  leaves  it  in  a  con¬ 


sistent  state. 
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The  part  of  a  distributed  system  that  orders  invocations  with  respect  to  one 
another  is  modeled  by  the  scheduler.  Invocations  resulting  from  ongoing  transac¬ 
tions  are  presented  to  the  scheduler,  which  may  accept  them  immediately  or  may 
chose  to  defer  their  execution.  The  scheduler  decides  on  an  order  in  which  to  per¬ 
form  invocations  it  accepts,  and  instructs  objects  to  execute  them  in  this  order. 
Objects  return  the  results  of  invocations  directly  to  the  user.  The  interaction 
between  the  components  of  the  model  is  shown  in  Figure  2.2. 

2.3.  Distributed  Objects 

We  now  discuss  how  the  behavior  of  a  distributed  object  may  be  specified.  Let 
DATA0  be  the  data  encapsulated  by  object  0  and  OP0  be  the  set  of  operations  on 
DATAq.  Let  op  denote  an  element  of  OP0.  An  invocation  of  O  at  site  s  to  per¬ 
form  op,  with  the  appropriate  number  of  arguments  of  the  correct  type,  is 


, n 


Invocations 


represented  as  [O.opO;  ■>].  When  an  object  executes  an  operation  and  returns  a 
result,  it  is  said  to  perform  an  action.  An  action  is  denoted  [O.opO;  rest)],  where 
rest)  is  the  result  returned.  A  serial  computation  is  a  sequence  of  actions  taken  by 
an  object,  representing  a  series  of  invocations  and  their  corresponding  results.  The 
behavior  of  an  object  is  defined  by  its  specification  SPq,  which  is  the  set  of  all 
correct  serial  computations. 

Let  us  illustrate  these  definitions  by  an  example.  Consider  an  object  x  of  the 
type  integer  discussed  earlier,  [.t. write! 21;;  okO]  is  a  possible  action,  denoting  an 
invocation  to  change  the  value  of  ,r  to  21  and  the  result  oki )  returned  by  x. 
[x.write(21);  ok() ]  [x. write!  15);  ok()l  [x.readO;  ok(  15)]  is  a  possible  serial  compu¬ 
tation  denoting  three  successive  invocations  and  their  corresponding  results.  Since 
this  serial  computation  reflects  correct  behavior  for  an  objet  of  type  integer,  it 
would  belong  to  the  specification  SPX.  On  the  other  hand,  the  serial  computation 
[x.write(21);  ok()]  [x.readO;  oki  15)]  probably  would  not  belong  to  the  specification, 
as  it  is  not  correct  for  the  value  15  to  be  the  result  of  a  read  immediately  after  the 
value  21  is  written. 

In  this  work,  we  are  not  concerned  with  the  actual  representation  of  object 
specifications.  This  may  be  done  using  logical  predicates  as  in  [27,  32],  by  state 
machines  as  in  [46],  by  algebraic  specifications  as  in  [22],  or  by  any  other  means. 
For  the  examples  we  present,  we  rely  on  the  reader's  intuitive  understanding  of  the 
semantics  of  the  operations  discussed  to  decide  whether  a  serial  computation 


belongs  to  a  specification  or  not. 


A  specification  SPq  is  said  to  be  complete  if  for  every  allowable  serial  compu¬ 
tation  and  for  every  operation  op  6  OP0,  there  is  another  serial  computation  that 
extends  the  first  one  by  an  action  [O.opO;  rest)],  for  any  correct  set  of  arguments 
for  op.  In  other  words,  after  any  sequence  of  invocations,  a  result  is  defined  for  the 
next  invocation,  whatever  the  operation  invoked  may  be.  Formally,  a  specification 
is  said  to  be  complete  if  the  empty  sequence  <p  6  SPq  and  if  for  all  x  €  SPq  and  all 
op  €  OPq,  there  exists  a  y  €  SP0  such  that  \  =  x  ♦  [O.op";  res' ' ] ,  where  •  is  the 
concatenation  operator  for  sequences.  We  assume  that  all  specifications  are  com¬ 
plete.  Since  it  is  valid  for  an  object  to  return  the  result  erron),  this  is  not  a  res¬ 
triction. 

A  specification  is  prefix  closed  if  for  every  x  6  SPq,  every  prefix  of  .t  also 
belongs  to  SP0.  This  excludes  serial  computations  that  cannot  be  performed  one 
invocation  at  a  time,  with  each  step  resulting  in  a  correct  computation.  We  assume 
that  all  specifications  are  prefix  closed. 

A  deterministic  object  is  one  that  always  returns  the  same  result  for  each  invo¬ 
cation  when  given  the  same  sequence  of  invocations  starting  from  the  initial  state. 
Given  a  sequence  of  invocations,  the  specification  of  a  deterministic  object  contains 
only  one  serial  computation  in  which  the  invocations  occur  in  that  order.  If  the 
specification  contains  more  than  one  such  serial  computation,  the  object  is  said  to 
be  non -deterministic.  When  a  non-deterministic  object  is  given  a  sequence  of  invoca¬ 
tions,  it  may  return  results  according  to  any  one  of  the  applicable  serial  computa¬ 
tions.  Our  results  hold  for  both  deterministic  and  non-deterministic  objects 


Although  the  behavior  of  an  object  is  specified  in  terms  of  serial  computations, 
it  is  not  necessary  that  actual  invocations  of  the  object  occur  serially.  Even  if  invo¬ 
cations  at  the  same  site  are  ordered  sequentially,  the  fact  that  the  system  is  asyn¬ 
chronous  means  that  invocations  from  different  sites  are  unordered  relative  to  one 
another,  unless  the  scheduler  explicitly  orders  one  before  the  other.  Thus,  invoca¬ 
tions  on  an  object  are,  in  general,  only  partially  ordered.  Under  these  conditions, 
we  say  that  a  distributed  object  behaves  according  to  its  specification  if  it  returns 
results  according  to  some  serial  computation  in  which  the  order  of  invocations  is 
consistent  with  the  partial  order  on  actual  invocations.  The  particular  serial  com¬ 
putation  chosen  would  depend  on  how  the  object  is  implemented. 

Let  PnntQueue  be  an  object  of  type  queue,  with  the  following  invocations: 
[PnntQueue.'msert(WarAndPeace)]  s],  [PnntQueue. inserliRumeoAndJuliet)-,  f]» 

[PnntQueue. ListQueueO;  $],  and  [PnntQueue. ListQueueO;  f],  Let  the  scheduler 
order  the  invocations  as  in  Figure  2.3.  Each  of  the  ListQueue  invocations  is 
ordered  after  both  the  insert  invocations,  but  the  insert  invocations  are  not 
ordered  relative  to  each  other.  Both  the  following  serial  computations  would  then 
be  consistent  with  the  partial  order  on  invocations: 

(1)  [PnntQueue . insertl WarAndPeace);  ok( )] 

[PnntQueue .insert(RomeoAndJuliet)\  ok()] 

[PnntQueue . ListQueueO;  ok(  WarAndPeace ,  RomeoAndJ uliet)] 

[PnntQueue. ListQueue* );  okt WarAndPeace,  RomeoAndJ uliet'*] 


[PnntQueue. insert!  WarAndPeace).  s| 


Figure  2.3.  A  partial  order  on  invocations 

(2)  [PnntQueue  .insert(RomeoAncUuliet)\  ok()] 

[PnntQueue  inseri(WarAndPeace)m,  ok()] 

[PrznfQueue.ListQueueO;  ok[RomeoAndJ  uhet ,  WarAndPeace)] 

[PnntQueue. ListQueueO;  ok (RomeoAndJuhet,  WarAndPeace )] 

Hence,  either  result  for  the  ListQueue  invocations  would  be  correct.  The 
actual  result  would  depend  on  how  the  object  is  implemented  and  on  factors  like 
message  delays,  etc.  On  the  other  hand,  if  the  scheduler  orders  the  insert  invoca¬ 
tion  at  t  after  the  insert  at  s  (Figure  2.4),  then  only  (2)  would  be  consistent  with 
the  partial  order,  and  only  one  result  would  be  possible. 

The  (partial)  order  in  which  invocations  are  performed  on  a  distributed  object 
0  is  modeled  by  its  history  H0  =  [l0,  -*q ],  where  lf)  is  a  set  of  invocations  and  — 
is  a  partial  order  on  them.  (We  use  the  notation  i  —>n  j  to  mean  (i,  j)  6  -*0.)  As 


described  above,  0  provides  results  for  its  invocations  based  on  some  sequential 


[PnntQueue.insert(WarAndPeace)\  si 


[PrintQueue. ListQueuet  ):  si  [PnntQueue. ListQueuei  );  f] 

Figure  2.4.  Another  partial  order  on  invocations 

computation  in  which  the  invocations  occur  in  an  order  consistent  with  the  partial 
order  —>0.  This  is  modeled  by  the  development.  The  development  D0  of  an  object  O 
is  a  total  order  on  the  invocations  in  I0;  that  is,  it  is  a  sequence  containing  all  the 
invocations  in  l0  (and  no  others).  We  say  that  D0  is  consistent  with  Hq  if  for  all 
pairs  of  invocations  i,  j  such  that  i  ~*0  j,  it  is  true  that  i  occurs  before  j  in  D0. 
The  set  of  actions  formed  by  pairing  each  invocation  in  D0  with  the  corresponding 
result  returned  by  O  is  called  the  response  of  0  and  is  denoted  Rq.  Note  that  we 
do  not  explicitly  model  the  state  of  an  object.  This  is  not  necessary,  since  the  state 
can  be  deduced  from  the  values  of  H0,  D0  and  R0. 

We  now  formalize  what  it  means  for  an  object  to  behave  according  to  its  'pos¬ 
sibly  non-deterministic'  specification,  when  the  invocations  may  be  only  partially 
ordered.  First,  its  development  Dr)  must  be  consistent  with  its  history  Hq.  A 
second  condition,  described  below,  states  that  the  response  R0  agrees  with  the 


specification  SPq.  Let  LegalComps0(Do )  be  the  set  of  serial  computations  in  SPq 
in  which  the  invocations  occur  in  the  same  order  as  in  D0.  (If  0  is  deterministic, 
there  is  only  one  such  computation  and  \LegalComps0{D0)\  =  1.)  For  each  compu¬ 
tation  c  in  LegalComps0(D0),  let  actions(c)  be  the  set  containing  all  the  actions  in 
c;  that  is,  the  order  on  the  invocations  is  overlooked.  Let  LegalResponses  <D  )  be 
the  set  {achons(c)  |  c  £  LegalComps q(Dq)\.  In  other  words,  LegalResponses q(Dq)  is 
the  set  of  responses  allowed  by  the  specification,  if  the  development  is  D0.  'Again, 
if  0  is  deterministic,  {LegalResponses  0{D0)\  =  1.)  For  O  to  behave  according  to  its 
specification,  Rq  must  belong  to  LegalResponses q(Dq). 

We  now  present  a  definition  that  will  be  used  later.  Let  lasnHQ)  be  the  set  of 
invocations  in  Ha  that  are  not  ordered  before  any  other  invocation.  In  other  words, 
last{H0 )  is  {i  |  £  £  Io  and  2  j:  i  — *0  j).  We  also  extend  the  definition  of 
LegalResponses q(D0)  to  cover  the  case  of  a  single  invocation  i  in  Dq.  Let 
LegalResponses q(i,  D0)  stand  for  the  set  of  actions  in  LegalResponses q(Dq)  that 
correspond  to  the  invocation  i.  Formally,  LegalResponses0u,  Dr,)  =  {a  |  a  is  the 
action  corresponding  to  i  in  R  for  some  R  £  Lega!Responses0(D,))\. 

2.4.  Transactions 

A  transaction  T  is  modeled  by  a  set  lT  of  invocations  and  a  partial  order  — *T 
on  these  invocations.  The  partial  order  reflects  the  data  flow  relationships  between 
invocations,  and  this  order  must  be  observed  in  any  execution  of  the  transactions. 
We  denote  the  set  of  all  possible  transactions  by  TRANS. 
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2.5.  The  scheduler 


The  function  of  the  scheduler  is  to  enforce  an  order  between  invocations  from 
different  transactions  in  such  a  way  that  the  transactions  appear  to  have  executed 
independently  from  one  another.  We  assume  that  the  scheduler  assigns  each  tran¬ 
saction  a  unique  identifier  called  a  transaction-ID.  The  behavior  of  the  scheduler  is 
modeled  by  a  system  history  HSYS  =  [Esy$,  “♦svsl-  where  is  a  set  of  events 

and  — is  a  partial  order  on  these  events.  An  event  has  the  form 
[Tid,  O.opO,  s]  and  represents  the  invocation  [Oop'!;  s;  issued  by  a  transaction 
with  identifier  Tid.  The  order  — »svs  reflects  the  ordering  decisions  made  by  the 
scheduler.  For  events  resulting  from  the  same  transaction  T,  the  order 
includes  all  the  elements  in  -*j.  In  addition,  -*srs  may  contain  elements  relating 
events  from  different  transactions. 

After  the  scheduler  has  ordered  invocations,  it  passes  them  on  to  the  relevant 
objects  for  execution.  The  system  history  gives  the  ordering  decisions  made  by  the 
scheduler,  while  an  object  history  gives  the  order  in  which  an  object  receives  invo¬ 
cations.  It  follows,  then,  that  object  histories  can  be  deduced  from  the  system  his¬ 
tory.  The  history  of  an  object  O  consists  of  all  invocations  in  corresponding 

to  0,  ordered  in  the  same  way  as  in  -»svs.  In  other  words,  H0  is  the  projection  of 
H$ys  on  O 

Finally,  observe  that  if  a  global  real  time  clock  were  available  in  the  system, 
the  events  in  the  system  history  could  be  totally  ordered  according  to  it.  It  follows 
that  each  event  e  m  Esy.s  can  be  assigned  a  unique  label  time le)  that  can  be  used 


to  place  the  events  in  this  total  order.  This  labeling  may  not  be  known  to  the 
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scheduler  or  to  any  other  component  of  the  system;  we  merely  use  the  fact  that  this 
labeling  must  exist.  Given  an  event  e,  the  set  beforeie)  is  defined  as  the  set 

{e'  |  time(e')  <  timeie)}. 

2.6.  Order  in  an  asynchronous  system 

We  have  said  that  the  scheduler  "orders”  events.  In  this  section,  we  discuss 
what  it  means  for  the  scheduler  to  order  one  event  after  another.  Recall  that  in  an 
asynchronous  system  there  is  no  global  clock  and  any  synchronization  between 
events  at  different  sites  must  be  based  on  messages  sent  between  them.  While  two 
events  at  the  same  site  can  be  ordered  relative  to  each  other  in  the  normal  way, 
the  only  way  for  a  scheduler  to  ensure  that  an  event  e2  at  a  site  t  is  ordered  after 
an  event  ^  ata  site  s  is  to  cause  a  message  to  be  sent  from  site  s  to  site  t  (perhaps 
indirectly  via  other  sites)  carrying  the  information  that  has  occurred  and  e2  may 
proceed.  This  problem  is  discussed  in  detail  in  [31].  We  call  the  messages  used  by 
the  scheduler  to  order  events  relative  to  one  another  synchronization  messages. 

We  now  formalize  this  notion.  Let  sender{m)  refer  to  the  site  from  which  a 
synchronization  message  m  is  sent  and  let  receicer(m)  be  its  destination.  A  mes¬ 
sage  path  exists  from  event  et  at  site  s  to  an  event  e2  at  site  t  either  if  s  =  t  and  e2 
occurs  after  el  at  this  site,  or  if  there  is  a  sequence  of  synchronization  messages 
m  i,  m  >,  ,  mn  for  which  the  following  conditions  hold. 

1.  sender!  m  =  s  and  m,  is  sent  from  s  after  occurs  there. 

2  senderirn)  =  receiver!  m  for  1  <  t  <  n,  and  m.  is  sent  after  rn.  _  j  is 


received. 
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3.  receiver(m^)  =  t  and  e2  occurs  after  is  received  at  t. 

It  follows  from  the  previous  discussion  that  two  events  in  an  asynchronous  sys¬ 
tem  can  be  ordered  relative  to  each  other  only  if  there  is  a  message  path  between 
them.  In  particular,  if  and  e2  belong  to  ESYs  and  ^1  — »$ys  e2<  then  there  is  a 
message  path  from  el  to  e2.  We  use  this  fact  in  Chapter  5. 


CHAPTER  3 


The  scheduler 


3.1.  Introduction 

The  scheduler  orders  invocations  in  such  a  way  that  the  effects  of  the  resulting 
execution  are  indistinguishable  from  one  in  which  the  transactions  are  executed 
one  after  another  in  some  serial  order.  A  history  resulting  from  ordering  invoca¬ 
tions  in  this  way  is  said  to  be  serializable,  and  is  called  an  S -history.  A  scheduler 
that  produces  only  S-histories  is  called  an  S -scheduler .  Serializability  in  database 
systems  has  been  studied  in  [3,  4,  5,  12,  18,  37,  38,  43],  among  others.  In  [25,  46], 
the  notion  of  serializability  is  generalized  to  abstract  data  types. 

An  S-scheduler  could  operate  by  fixing  an  order  on  transactions  and  ordering 
every  invocation  of  a  transaction  after  all  invocations  of  transactions  ordered  before 
it.  The  resulting  histories  would  be  serializable,  but  the  disadvantage  of  such  a 
scheduler  is  clear.  Only  one  transaction  can  be  operational  in  the  system  at  any 
time  —  every  transaction  must  wait  until  all  the  invocations  of  the  previous  one 
have  completed  before  it  can  proceed.  This  is  often  an  unnecessary  restriction. 
Consider,  for  example,  a  transaction  that  invokes  an  integer  object  Temperatureln- 
Ithaca  to  perform  write*  23)  and  a  second  transaction  that  preforms  write1 82)  on 
another  object  TemperaturelnPaloAlto.  There  is  no  need  for  either  transaction  to 
wait  for  the  other  to  complete,  since  they  operate  on  different  objects  and  the 
results  of  their  invocations  would  be  the  same  even  if  they  are  executed  con- 
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currently.  Because  the  scheduler  described  above  places  an  order  between  such 
invocations,  it  could  cause  unnecessary  delays  in  processing  transactions.  It  also 
takes  no  advantage  of  the  replicated  processing  power  available  in  a  distributed 
system.  In  general,  it  is  desirable  for  a  scheduler  to  permit  as  many  invocations  as 
possible  to  execute  concurrently,  while  still  maintaining  serializability. 

The  level  of  concurrency  permitted  by  a  scheduler  can  be  measured  by  study¬ 
ing  the  class  of  histories  it  generates.  This  problem  has  been  studied  in  the  context 
of  databases,  where  the  operations  are  read  and  write  on  single  valued  data  items. 
Kung  and  Papadimitriou  measure  the  performance  of  a  scheduler  by  its  fix point  set, 
which  corresponds  to  the  set  of  histories  allowed  by  the  scheduler  [29],  The  larger 
this  set,  the  more  concurrency  the  scheduler  allows.  The  properties  of  different 
sub-classes  of  serializable  histories  <e.g.  DSR,  VSR  and  CPSR)  have  been  studied 
in  [23,  37,  38,  48].  In  this  chapter,  we  generalize  these  ideas  to  the  case  of  arbi¬ 
trary  data  types,  with  general  operations  on  the  data.  We  define  the  notion  of 
conflicting  invocations  in  this  setting,  and  introduce  a  class  called  C-histories, 
which  is  a  generalization  of  the  classes  DSR  or  CPSR  in  database  theory 


3.2.  When  can  invocations  be  left  unordered? 


Invocations  have  to  be  ordered  relative  to  each  other  when  their  effects,  as 
perceived  by  a  user,  could  depend  on  the  order  on  which  they  are  executed.  When 
a  user  cannot  distinguish  the  relative  order  of  a  pair  of  invocations  based  on  their 
results  or  the  results  of  future  invocations,  the  invocations  can  be  executed  in 
parallel.  It  is  not  incorrect  for  a  scheduler  to  order  such  invocations,  but  this 


lowers  the  level  of  concurrency  in  the  system. 


A  simple  case  where  it  is  unnecessary  to  order  invocations  is  when  they  access 
different  objects.  The  result  of  one  invocation  is  independent  of  the  other  and 
hence  of  the  order  in  which  they  are  carried  out.  Moreover,  the  data  of  each  object 
are  left  in  the  same  state  regardless  of  the  order  in  which  the  invocations  are  per¬ 
formed.  Consequently,  the  result  of  no  future  invocation  on  either  object  will 
depend  on  their  order.  The  order  is  hence  indistinguishable  to  a  user 

Sometimes  it  is  not  necessary  to  order  operations  even  when  they  access  the 
same  object.  Consider  two  transactions,  each  performing  a  read  operation  on  the 
object  LottoW inner  of  type  integer.  The  result  returned  for  each  invocation  is  the 
same,  regardless  of  the  order  in  which  the  reads  occur.  Moreover,  LottoWinner 
remains  in  the  same  state  in  either  case,  so  the  result  of  no  subsequent  invocation 
on  LottoWinner  could  depend  on  this  order.  This  is  an  example  where  invocations 
need  not  be  ordered  relative  to  each  other  because  of  the  semantics  of  the  opera¬ 
tions  in  question  (in  this  case,  the  semantics  of  read). 

Consider  an  object  .r  of  type  number,  which  provides  an  operation  multiply. 
Invoking  multiply*  v>  sets  the  value  of  r  to  the  value  obtained  by  multiplying  the 
current  value  of  x  by  v.  The  result  returned  is  the  new  value  of  x.  Now,  if  one 
transaction  performs  .t. multiply*  2)  and  another  one  performs  x.multiply*3),  the 
results  would,  in  general,  differ  depending  on  the  order  in  which  these  invocations 
are  carried  out.  However,  if  the  current  value  of  x  was  0,  the  order  in  which  the 
invocations  took  place  would  be  indistinguishable  to  a  user.  If  the  scheduler  knew 
the  value  of  r,  it  could  leave  the  invocations  unordered  Thus,  knowledge  of  the 
current  state  of  an  object  can  be  used  to  avoid  ordering  operations. 


Again,  if  two  transactions  were  each  performing  a  write  on  an  object  of  type 
integer,  their  invocations  would  normally  have  to  be  ordered  relative  to  each 
other,  because  the  result  of  a  subsequent  read  operation  could  depend  on  the  order. 
However,  if  it  were  known  that  the  values  being  written  were  the  same,  the  order 
of  invocations  would  be  immaterial.  In  this  case,  the  arguments  for  the  invocation 
determine  whether  they  can  be  left  unordered 

Finally,  consider  two  invocations  if  the  add  operation  on  an  object-  of  type 
indexed_file,  described  earlier.  If  it  is  known  that  no  transaction  performs  a  List- 
File  operation,  then  the  invocations  need  not  be  ordered,  as  this  is  the  only  opera¬ 
tion  whose  result  depends  on  the  way  in  which  add  operations  are  ordered  relative 
to  one  another.  Here,  knowledge  about  the  future  behavior  of  transactions  can  be 
used  to  avoid  ordering  invocations. 

We  see  from  these  examples  that  knowledge  of  the  semantics  of  operations, 
the  current  state  of  objects,  the  arguments  for  a  particular  invocation,  or  the  future 
behavior  of  transactions  can  all  be  used  by  the  scheduler  to  increase  concurrency 
by  not  ordering  certain  operations  relative  to  others.  A  scheduler  that  maintains 
and  uses  all  this  information  would,  unfortunately,  be  extremely  complex  and 
inefficient.  A  practical  scheduler  uses  some,  but  not  all,  of  this  information  and 
hence  may  order  more  operations  than  strictly  necessary  for  serializability.  Exam¬ 
ples  of  schedulers  for  database  systems  are  given  in  [1,  12,  19,  29,  37,  42,  46]. 
Kung  and  Papadimitriou  [291  model  the  level  of  knowledge  available  to  a  database 
scheduler  and  present  optimal  schedulers  for  different  levels  of  knowledge. 


Recall  that  the  aim  of  this  work  is  to  develop  an  efficient  method  of  managing 
replicated  data.  In  particular,  we  present  an  implementation  in  which  the  informa¬ 
tion  transmitted  to  keep  data  consistent  is  included  in  synchronization  messages 
that  must  be  sent  by  the  scheduler  even  if  data  are  not  replicated.  To  be  com¬ 
pletely  general,  we  allow  for  the  case  where  the  scheduler  may  send  as  few  mes¬ 
sages  as  possible.  We  allow  for  the  possibility  that  the  scheduler  may  have  a  high 
level  of  knowledge,  which.it  uses  to  limit  the  number  of  invocations  it  orders.  In 
Section  2.6,  we  observed  that  a  scheduler  in  an  asynchronous  system  can  order 
invocations  only  by  sending  messages  between  sites.  Hence,  if  the  scheduler  orders 
fewer  operations,  it  will  send  fewer  synchronization  messages.  We  show  that,  even 
under  these  conditions,  the  messages  that  the  scheduler  must  send  are  sufficient  for 
the  implementation  of  replicated  data  to  operate  correctly  A  practical  scheduler, 
operating  with  less  knowledge,  can  only  send  more  synchronization  messages.  As  a 
result,  the  implementation  remains  valid. 

In  terms  of  our  model,  we  assume  that  the  scheduler  could  have  knowledge  of 
all  object  specifications  (and  hence  of  the  semantics  of  all  operations),  the  current 
system  history  (which  together  with  the  object  specifications  yields  information 
about  the  current  state  of  each  object),  the  arguments  for  each  invocation,  and  the 
set  TRANS  of  all  possible  transactions  (which  amounts  to  information  of  possible 
future  behavior  of  transactions).  On  the  other  hand,  the  scheduler  can  have  no 
knowledge  of  the  actual  implementation  of  an  object.  The  scheduler  bases  its  deci¬ 
sions  on  the  fact  that  objects  meet  their  specifications,  but  where  a  specification  can 
be  satisfied  in  more  than  one  way,  the  scheduler  can  make  no  assumptions  about 
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which  way  an  object  choses.  In  terms  of  the  model,  while  the  scheduler  may  have 
knowledge  of  the  history  H0  of  an  object  0  (this  is  simply  the  projection  of  H^ys  on 
O',  all  it  knows  about  the  development  D0  is  that  it  is  consistent  with  H0. 
Further,  it  has  no  knowledge  of  the  response  Rq  except  that  it  belongs  to 
Legal  Respon.sein^D0!  for  some  D()  consistent  with  Hq.  However,  if  Hrj  and  the 
specification  SP0  are  such  that  only  one  value  of  D0  or  R0  satisfy  the  conditions 
above,  the  scheduler  may  make  use  of  this  knowledge. 

We  now  characterize  the  invocations  that  a  scheduler  must  order,  given  that  it 
has  the  kind  of  knowledge  discussed  above.  Any  practical  scheduler,  having  less 
knowledge,  orders  a  superset  of  these  invocations.  First,  we  formalize  the  notion  of 
serializability. 

3.3.  Serializability 

Let  H$ be  a  system  history  and  let  H0  be  the  projection  of  Hsys  on  object 
O  Let  D0  be  any  development  consistent  with  Hr,  'not  necessarily  the  actual 
development  of  0)  and  let  R0  be  any  response  belonging  to  LegalRe<ponsesr-)iD0). 
Let  be  a  total  order  on  the  transaction-ID’s  in  HSYs  >s  called  a  seriali¬ 

zation  order  for  vs ,  and  represents  the  order  in  which  a  user  perceives  transac¬ 
tions  to  have  executed  'although  their  invocations  may  actually  be  interleaved  in 
Hsy^).  H^ys  is  serializable  if  the  following  condition  holds. 

Define  D  =  Reorder*  D ,,,  — as  the  development  formed  by  placing  the 
invocations  in  Dr,  in  the  order  that  would  have  resulted  had  the  transactions  actu¬ 
ally  executed  in  the  order  specified  by  —»>£/?■  That  is,  if  i;  =  [O  op  >,  s]  in  D0 
corresponds  to  [Tidx,  O  op t <  > ,  s]  in  W.svs  and  <2  =  [O.opoU,  f]  in  Dn  corresponds  to 


event  [Tid>,  0  op2(),  r]  in  H$YS,  and  if  Tidi  ~*^ER  Tid ,,  then  i  ^  occurs  before  i  >  in 
Reorder(D 0,  -* SER )■  HSYS  's  serializable  if  the  response  R0  is  legal  based  on  the 
development  Reorder(D0,  -*$ER)-  That  is,  HSYS  is  serializable  if  the  response 
based  on  D0  is  indistinguishable  from  one  based  on  a  development  that  would  have 
resulted  had  the  transactions  executed  sequentially. 

Formally,  HSYS  *s  serializable  if  3  -*RER:  [V  objects  0  and  V  D0  consistent 
with  the  projection  of  H$Y$  on  O:  LegalResponses,yReorderiDo,  ~*ser])  C 
LegalResponses0(D0)]. 

3.4.  Sub-classes  of  serializable  histories 

In  [37],  it  is  shown  that  the  problem  of  recognizing  whether  a  given  database 
history  is  serializable  is  iVP-complete.  It  is  also  shown  that  a  scheduler  that  gen¬ 
erates  histories  that  span  the  complete  set  of  S-histories  must  take  time  exponen¬ 
tial  in  the  number  of  invocations  (unless  P  =  NP).  An  efficient  scheduler  (one  that 
takes  time  polynomial  in  the  number  of  invocations)  generates  only  a  subset  of  all 
possible  5-histories  This  sacrifices  some  of  the  concurrency  available  in  the  sys¬ 
tem  because  certain  executions  that  are  actually  serializable  are  disallowed 

Different  sub-classes  of  S-histories  and  their  corresponding  schedulers  have 
been  studied  in  the  context  of  databases  [23,  37],  Papadimitriou  [37]  shows  that 
the  class  DSR  encompasses  the  classes  of  histories  generated  by  most  known 
scheduling  disciplines  like  two-phase  locking,  timestamping,  etc.  The  class  DSR  is 
based  on  the  notion  of  conflicting  invocation s.  Two  invocations  in  a  database  system 
conflict  if  they  both  access  the  same  object  and  one  of  them  performs  a  write  opera¬ 
tion.  In  a  history  belonging  to  the  class  DSR  any  two  conflicting  invocations  must 


be  ordered  relative  to  each  other,  We  generalize  this  idea  to  distributed  objects, 
and  call  the  corresponding  class  of  histories  C -histories.  A  scheduler  that  generates 
C-histories  is  called  a  C -scheduler .  We  assume  that  the  scheduler  in  our  system  is 
a  C-scheduler. 

3.5.  C-histories 

An  invocation  is  said  to  conflict  with  another  if  the  order  in  which  the  invoca¬ 
tions  are  executed  could  make  a  difference  to  their  results  or  to  the  result  of  some 
future  invocation.  The  earlier  discussion  on  serializability  showed  that  whether  an 
invocation  conflicts  with  another  could  depend  on  the  semantics  of  the  operations  in 
question,  the  current  state  of  the  objects,  the  arguments  for  the  invocations  and  the 
future  behavior  of  transactions.  We  take  this  into  account  in  the  definition  of 
conflict  given  below.  In  a  C-history,  every  invocation  is  ordered  relative  to  any 
invocation  that  it  conflicts  with.  This  does  not  preclude  other  orderings,  but  merely 
stipulates  a  set  of  orderings  that  must  be  present  in  a  C-history. 

We  now  define  conflict  formally.  Define  the  extension  of  a  system  history  to  be 
any  history  that  could  result  from  the  completion  of  ongoing  transactions  and/or 
the  execution  of  any  new  transactions  from  TRASS.  Let  e2  =  [Tid2,  0  opt),  s]  be 
an  event  in  H^ys  and  let  =  [Tidl,  O  opi  1  /]  be  any  other  event  in  befjreie-t) 
that  has  a  different  transaction-ID,  but  invokes  the  same  object.  Let  H‘sys  be  an 
extension  of  before^e-A  that  contains  e,  and  in  which  e2  is  not  ordered  relative  to  et. 
Let  H'< )  be  the  projection  of  H' svs  on  ( >  Let  D  ,  be  any  development  consistent 
with  H'0  of  the  form  yili2S,  where  iL  and  •,  are  the  invocations  corresponding  to  eL 
and  e,  respectively,  and  y  and  6  are  sequences  of  invocations.  Then  e2  conflicts 
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with  t,1  if  LegalResponsesotyi^iiS)  LegalResponses o* 7; 2i  1  1°  other  words,  e2 
conflicts  with  ex  if  the  result  of  ilr  io,  or  any  future  invocation  could  differ  depend¬ 
ing  on  the  order  in  which  and  i2  are  executed. 

The  notions  of  conflict  and  C-histories  as  defined  above  are  quite  general. 
When  applied  to  a  system  of  single-valued  objects  with  only  read  and  write  opera¬ 
tions,  they  reduce  to  the  corresponding  definitions  in  database  theory.  The 
definition  of  conflict  takes  into  account  the  semantics  of  operations  and  the  argu¬ 
ments  for  each  invocation  by  being  defined  in  terms  of  LegalRetpon.se* q,  which  in 
turn  is  defined  in  terms  of  SPf).  SPo  encodes  the  semantics  of  all  the  operations  of 
0  The  definition  also  accounts  for  the  current  state  of  objects  because  only  exten¬ 
sions  of  beforeie)  are  considered.  Knowledge  of  the  future  behavior  of  transactions 
is  included  because  extensions  can  be  formed  only  by  including  transactions  from 
the  set  TRANS.  Since  the  only  orderings  that  must  necessarily  be  present  in  a  f- 
history  are  those  between  conflicting  invocations,  our  definition  of  C-schedulers 
includes  schedulers  that  may  use  high  levels  of  knowledge  to  avoid  ordering  invo¬ 
cations.  Hence,  the  restriction  to  C-schedulers  is  not  a  major  one. 


CHAPTER  4 


Schemas 


4.1.  Introduction 

The  specification  of  a  distributed  object  describes  its  behavior  from  the  point  of 
view  of  external  effects.  In  this  chapter,  we  consider  the  internal  implementation 
of  a  distributed  object;  that  is,  how  invocations  at  different  sites  are  coordinated  to 
provide  the  behavior  required  at  the  external  interface  of  the  object.  There  are 
several  ways  in  which  an  object  can  be  implemented,  while  still  meeting  its  exter¬ 
nal  specification.  We  first  describe  two  possibilities:  a  centralized  implementation 
and  a  replicated  implementation.  We  then  discuss  why  they  are  inefficient  and  lay 
the  groundwork  for  a  more  efficient  implementation. 

4.2.  Two  possible  implementations 

The  centralized  implementation  is  similar  to  the  method  described  in  [44], 
One  of  the  sites  where  the  object  is  accessible  is  chosen  as  the  "master,”  while  the 
other  sites  are  "slaves.”  All  invocations  are  executed  sequentially  by  the  master. 
Invocations  scheduled  at  the  slaves  are  passed  on  to  the  master  for  execution.  The 
results  of  such  invocations  are  sent  from  the  master  back  to  the  slaves,  which  give 
the  result  to  the  user.  Such  a  centralized  implementation  makes  sense  if  the  slaves 
are  sites  with  little  or  no  processing  capacity  This  is  true,  for  example,  with  bank 
teller  machines  connected  to  a  central  computer 
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If,  on  the  other  hand,  all  sites  have  comparable  processing  capacity  (and  the 
results  in  this  work  are  primarily  aimed  at  such  systems),  a  replicated  implementa¬ 
tion  is  possible.  Here,  a  copy  of  the  object  data  and  the  definitions  of  the  operations 
are  placed  at  each  site  where  the  object  is  accessible.  When  an  invocation  is 
scheduled  for  execution,  the  site  at  which  it  is  scheduled  (i.e.  the  local  site)  informs 
the  other  sites  where  the  object  is  accessible  ithe  remote  sites)  of  the  execution.  All 
the  sites  execute  the  operation  in  question,  each  using  its  own  copy  of  data.1  The 
local  site  returns  the  result  to  the  user.  This  implementation  requires  some 
mechanism  to  ensure  that  invocations  are  executed  in  the  same  order  at  all  sites, 
otherwise  a  copy  of  the  data  could  become  inconsistent  with  other  copies.  Because 
of  such  inconsistencies,  a  site  may  return  results  that  are  not  permitted  by  the 
specification.  This  is  described  in  the  following  example. 

Consider  an  integer  object  x  such  that  Accessiblex  =  {.<,  t\.  Assume  that  fol¬ 
lowing  totally  ordered  sequence  of  invocations  occurs:  [.r.write(2),  .s]  [,r.write(3),  s] 
[.r.readO,  .$]  [.r.readO,  r].  If  .r  behaves  according  to  its  specification,  the  result  for 
both  the  read  invocations  should  be  ok(3),  because  the  invocation  writet3)  followed 
the  invocation  write! 2).  Assume  that  the  write  operations  on  the  copy  of  x  at  site  s 
occur  in  the  order  above,  but  the  writes  are  erroneously  performed  in  the  opposite 
order  at  site  t,  that  is,  write(3)  occurs  before  write(2).  When  the  invocation 
[.r.readO,  t]  is  performed  on  the  copy  at  t,  the  result  will  be  ok(2),  which  is 
incorrect. 

*An  optimization  is  possible  in  the  case  of  operations  that  do  not  change  the 
state  of  an  object,  e.g.  a  read  operation  on  an  integer.  These  can  be  performed  at 
any  one  site  without  informing  the  others  of  it. 


One  way  to  avoid  this  problem  in  a  replicated  implementation  is  to  acquire  a 
''lock”  at  all  sites  before  any  operation  is  executed.  A  lock  acquired  at  a  site  is 
released  when  the  execution  is  completed  at  that  site.  A  lock  cannot  be  acquired  a 
site  if  it  is  already  acquired  there  for  another  invocation;  the  later  invocation  is 
delayed  until  the  earlier  one  completes  and  the  lock  is  released.  This  scheme 
ensures  that  invocations  are  performed  in  the  same  order  at  every  site.  However, 
deadlock  may  occur  if  an  attempt  is  made  to  acquire  locks  for  more  than  one  invo¬ 
cation  simultaneously  —  one  invocation  may  acquire  the  lock  at  some  sites,  while 
another  invocation  acquires  it  at  the  other  sites.  The  scheme  can  be  modified  to 
use  a  deadlock-free  protocol  to  acquire  locks,  or  to  detect  when  deadlock  occurs  and 
take  corrective  action. 

The  advantage  of  a  replicated  implementation  is  that  even  if  one  or  more  of 
the  sites  at  which  an  object  is  accessible  fail,  the  operational  sites  can  continue  to 
accept  and  process  invocations  because  an  up-to-date  copy  of  the  object’s  data  is 
available  at  the  operational  sites.  This  is  not  true  of  the  centralized  implementa¬ 
tion,  where  if  the  master  fails,  the  object  cannot  process  invocations  scheduled  at 
the  slaves. 

The  disadvantage  of  both  the  centralized  and  the  replicated  implementations 
given  is  that  they  are  essentially  synchronous.  Both  execute  operations  sequen¬ 
tially,  the  centralized  implementation  doing  so  at  the  master,  while  in  the  repli¬ 
cated  implementation  all  ates  operate  in  tandem.  A  synchronous  implementation 
has  the  following  drawbacks.  Every  invocation  requires  communication  between 
sires.  This  increases  the  number  of  messages  being  sent  in  the  system.  This  mes- 


sage  transmission  also  introduces  latency  when  operations  are  performed,  because 
a  user  invoking  an  object  must  wait  for  a  message  to  be  sent  between  sites  before 
receiving  a  result  (Section  1.2). 

Clearly,  it  is  desirable  to  reduce  the  number  of  messages  sent,  to  cut  down 
latency  and  to  perform  operations  in  parallel  when  possible.  To  do  so,  it  is  neces¬ 
sary  to  decouple  the  execution  of  an  operation  at  the  local  site  from  its  execution  at 
remote  sites.  The  local  site  can  then  execute  an  operation  and  return  its  result 
without  waiting  for  remote  sites.  This  eliminates  latency.  In  [17,  45],  this  issue  is 
discussed  in  the  context  of  database  systems.  Decoupling  remote  execution  also 
means  that  while  an  operation  is  being  executed  at  a  local  site,  other  operations 
can  be  executing  at  the  remote  sites,  thus  taking  advantage  of  the  parallelism  in 
the  system.  Of  course,  this  decoupling  must  be  done  in  a  way  that  does  not  result 
in  inconsistency  as  exhibited  in  our  earlier  example.  In  Chapter  5,  we  present  two 
implementations  that  achieve  these  aims.  We  now  model  the  internal  implementa¬ 
tion  of  a  distributed  object  by  means  of  schemas  and  derive  a  result  that  will  be 
used  as  a  basis  of  those  implementations. 

4.3.  Schemas 

Except  in  the  most  trivial  objects  <e.g.  one  whose  operations  always  return  the 
same  result),  it  is  not  possible  to  totally  decouple  events  at  one  site  from  those  at 
others.  To  provide  a  correct  response,  a  local  site  must  have  information  about 
invocations  that  have  been  scheduled  at  other  sites.  However,  it  is  not  always 
necessary  to  have  knowledge  of  all  such  invocations.  For  example,  consider  an 
object  of  type  integer.  A  local  site  can  provide  a  correct  result  for  a  read 
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invocation  if  it  has  knowledge  of  all  write  invocations;  it  does  not  need  information 
about  other  read  invocations.  In  general,  a  site  in  a  non-synchronous  implementa¬ 
tion  responds  to  an  invocation  based  on  partial  information  about  the  history  of  an 
object. 

We  call  the  partial  history  upon  which  an  implementation  bases  its  response 
to  an  invocation  i  the  perspective  for  i.  In  both  the  centralized  and  the  replicated 
implementations  described  earlier,  for  example,  the  perspective  for  any  invocation  ; 
is  the  entire  history  of  the  object.  For  an  integer  object  implemented  as  described 
above,  the  perspective  for  any  read  invocation  is  the  part  of  the  history  containing 
all  the  write  invocations.  In  this  work,  we  are  not  concerned  with  how  a  particular 
implementation  may  be  expressed  or  described.  We  observe  only  that  given  an 
implementation,  it  is  always  possible  to  define  the  perspective  for  any  invocation  i, 
even  if  only  by  exhaustive  enumeration.  Thus  we  can  model  an  implementation  in 
terms  of  perspectives.  We  formalize  these  ideas  below. 

An  implementation  of  a  distributed  object  is  modeled  by  a  schema.  A  schema 
is  a  function  that  when  given  an  object  history  H0  =  [ Ir, ,  -»o]  and  an  invocation 
i  f;  last(H, ))  gives  an  i -history.  An  /-history  is  a  history  H'r>  =  [l'o,  — ►b  1,  where  I'o  is 
a  subset  of  I0  that  contains  i.  The  partial  order  -*'0  is  formed  by  taking  all  the 
elements  in  — »»  that  are  relations  between  invocations  in  i0.  In  other  words, 
— ►,)  =  by,  k)\j,  k  £  I'r,  and  (y,  k)  €  - *0\ .  The  /-history  given  by  a  schema  defines 
the  perspective  for  an  invocation  i  in  the  corresponding  implementation. 


4.3.1.  Correctness  of  a  schema 

We  now  discuss  what  it  means  for  a  schema  S  to  be  correct.  For  all  histories 
and  for  all  invocations  i,  the  result  returned  based  on  the  i -history  specified  by  S 
should  be  the  same  as  one  that  would  be  returned  based  on  the  entire  history.  For¬ 
mally,  let  H0  be  any  history  for  object  0,  and  let  i  be  any  invocation  in  last{H0). 
Let  D0  be  any  development  consistent  with  H0.  Let  Hl0  be  the  /-history  defined  by 
schema  S  and  let  D‘0  be  any  development  consistent  with  H0.  S  is  a  correct 
schema  i {  LegalResponses0(D‘0)  C  LegalRespon$es0(D0). 

4.3.2.  Correctness  with  respect  to  a  class  of  system  histories 

We  have  defined  what  it  means  for  a  schema  to  be  correct  with  respect  to  all 
possible  histories  Hq.  However,  if  it  is  known  that  all  system  histories  belong  to  a 
particular  class  and  that  all  object  histories  are  projections  of  histories  from  this 
class,  then  a  schema  need  be  correct  only  with  respect  to  such  histories.  It  is  often 
possible  to  optimize  object  implementations  in  such  a  way  that  they  are  efficient  for 
histories  from  a  particular  class,  but  may  be  inefficient  or  even  incorrect  when  his¬ 
tories  are  not  from  this  class.  In  particular,  we  are  interested  in  implementations 
that  are  efficient  with  respect  to  projections  of  C-histories.  It  is  straightforward  to 
extend  the  definition  of  correctness  of  a  schema  to  give  the  definition  of  correctness 
with  respect  to  a  particular  class  A"  of  system  histories.  We  merely  replace  "let  Ho 
be  any  history  for  object  0”  in  the  definition  of  correctness  by  "let  H0  be  any  his¬ 
tory  that  is  a  projection  of  a  A-history  on  object  0  " 
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4.3.3.  C-schemas 

Let  H0  =  [I0,  -*0]  be  a  projection  of  a  C-history  on  0  and  let  i  =  [O.opO,  s] 
belong  to  last(H0).  A  C-schema  is  one  in  which  the  t-history  Hq  =  [Iq,  -*^1 
satisfies  the  following  conditions. 

1.  i  €  ll0- 

2.  If  j  6  I'q  and  k  —*0  J,  then  k  6  Iq  and  k  J- 

In  other  words,  the  i-history  specified  by  a  (.'-schema  contains  all  invocations 
on  0  that  the  scheduler  orders  before  i,  and  maybe  other  invocations.  If  the 
history  contains  invocations  that  are  not  ordered  relative  to  i,  then  it  also  contains 
all  invocations  ordered  before  these  invocations.  We  show  below  that  C-schemas 
are  correct  with  respect  to  C-histories.  A  C-schema  will  be  used  in  Chapter  5  as 
the  basis  of  an  efficient  implementation  of  distributed  objects. 

4.3.4.  Correctness  of  a  C-schema  with  respect  to  C-histories 

We  now  show  that  if  an  object  responds  to  each  invocation  r  based  on  an  i- 
history  specified  by  a  C-schema,  the  results  returned  are  consistent  with  a  response 
based  on  the  entire  object  history,  given  that  the  system  history  is  a  C-history. 

Let  H0  be  a  projection  of  a  C-history  on  O  and  let  :  =  [O  opi),  •>•]  belong  to 
lastlHry).  Let  D0  be  any  development  consistent  with  H,}.  Let  H'n  be  an  i-history 
resulting  from  a  C-schema,  and  let  D0  be  any  development  consistent  with  H'0.  To 
prove  that  a  C-schema  is  correct  with  respect  to  C-histories,  we  must  show  that 

Le^alRe>ponsHS(-)U,  D'q)  C  ,  DnK 


The  proof  is  as  follows.  We  construct  a  sequence  of  developments  D0  =  D°, 
Dl,  D ",  ,  Dlast  such  that  Dl0  is  a  prefix  of  Dlast .  Each  development  DR  is  con¬ 

sistent  with  H0.  For  all  k  >  0,  D*  is  obtained  from  Dk~l  by  switching  the  order  of 
two  invocations,  but  maintaining  the  property  that  LegalRe'ipon$e*<-)\Dr’  >  = 

LegalRespon$e$Q(DK~)-).  Hence,  LegalRe^ponsesoiD0^  —  Leg  a  //?  e  sp  o  n  se  s ,)  i  Z?  ‘ J  ,''f  i . 
Since  Dq  =  D°,  LegalResponse$o{D0)  =  LegalResponses^D"'*),  and  hence 
LegalResponses  q( i ,  D°)  =  LegalRe*ponxe$Qii,  D  "sfi.  Because  ZZ;  is  a  prefix 
of  and  specifications  are  complete  and  prefix  closed,  it  follows  that 

LegalRe$ponse$o(i,  Dq)  =  LegalResponsesQd ,  Dq,). 

The  algorithm  used  to  construct  the  sequence  of  developments  is  given  below. 
A  proof  that  the  algorithm  terminates  and  a  proof  of  its  correctness  follow.  We  use 
a:a2  ■  am  to  represent  D'0  and  to  represent  Dh  for  the  current  value  of 

k  (each  a,  or  /9.  represents  an  invocation  on  0).  Note  that  n  s  m  because  all  Dr 
are  consistent  with  H0,  and  H0  contains  all  the  invocations  in  Hq. 

4. 3. 4.1.  The  algorithm 
(1)  Let  k  =  0  and  Dk  =  Dq. 

< 2 )  Let  p  be  the  largest  integer  0  p  ^  m  such  that  at  a„  =  /8n.  In 

other  words,  p  is  the  length  of  the  longest  common  prefix  of  D:}  and  Dr 
If  p  -  m,  Dq  is  a  prefix  of  DR  and  the  algorithm  terminates. 

1 3 )  If  p  *  m,  consider  the  invocation  a,.[. 

a.,  - 1  must  occur  in  D'!  because  D'  contains  all  the  invocations  in  ln<  and  In 
includes  all  the  invocations  in  I0  and  hence  in  Dq. 


Let  pi  be  the  same  invocation  as  aP^\- 

Note  that  l  >  p  +  1,  otherwise  the  longest  common  prefix  could  have  been 
extended  to  p  +  1. 

1 4)  The  next  develonment  Dk  +  1  is  fix  >3 a ;  that  is,  D"~~l  is 

the  same  as  DK  but  with  the  positions  of  $<_!  and  interchanged.  We  show 
below  that  LegnlResponse^0\D'’~ l)  =  Lega!Responses0fDk). 

i5">  Set  k  to  the  value  of  k  +  1  and  go  to  step  i2). 

Figure  4.1  shows  one  step  of  the  algorithm  with  k  =  0  and  p  —  3. 


Do~D 0  =  p2  03  P/  lP/ 

D1  =  0i  02  03  0/0/  -l 


Qlast  =  a  102013(14 
Dl()  =  Q]  Q2  Q3  Q4 


Sams  Invocation 


Figure  4.1.  A  step  in  the  algorithm 
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4. 3.4.2.  Proof  of  termination 

Each  iteration  moves  the  position  of  /?-  (corresponding  to  ap^x  in  Dq}  one  posi¬ 
tion  to  the  left  in  DR  to  give  Dk  +  1.  Hence  after  l  —  (p  +  1)  iterations,  this  invoca¬ 
tion  will  reach  position  p  +  1  in  Dk .  The  next  iteration  will  increase  p,  the  length 
of  the  longest  common  prefix  between  Dq  and  DK ,  to  at  least  p  -f  1  and  repeat  the 
process.  Thus  the  value  of  p  continues  to  increase  as  the  algorithm  progresses. 
Eventually  its  value  will  become  equal  to  m  and  the  algorithm  will  terminate. 

4. 3. 4.3.  Proof  that  all  developments  DR  are  consistent  with  H Q 

This  is  proved  by  induction  on  k.  The  base  case,  k  =  0,  follows  immediately 
since  D°  =  D0,  which  is  consistent  with  H0. 

For  the  inductive  step,  assume  that  Dk  is  consistent  with  Hq.  Since 
/  >  p  4-  1  (see  note  in  Step  3),  the  invocation  /T  _  i  does  not  occur  in  fix  ■  •  ■  and 
because  aL  ■  ap  =  fii  •  y3p,  it  is  not  one  of  the  first  p  invocations  in  D'0  either. 
The  next  invocation  in  Dq,  an  +  1,  is  the  same  as  /J  'Step  3).  Hence  the  invocation 
_ [  must  occur  after  ap^l  (viz.  fi  i  in  Dq  or  not  occur  in  Dq  at  all.  We  consider 
these  two  possibilities  separately  below  and  show  that  in  either  case,  and  /? •_  l 
are  unordered  relative  to  each  other  in  Hq. 

If  the  invocation  /?/_!  occurs  in  Dq,  it  occurs  after  ft  If  L  and  {3,  were 
ordered  relative  to  each  other  in  Hq,  it  follows  from  the  definition  of  a  schema  that 
they  would  be  ordered  the  same  way  in  H,,  However,  >  occurs  before  /S  in  D'  , 
which  is  consistent  with  Hq,  and  after  /8  in  Dn,  which  is  consistent  with  H'n  It 
follows,  then,  that  and  fi  must  be  unordered  in  Hn 
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The  other  possibility  is  that  y5  does  not  occur  in  D}y.  The  definition  of  C- 
schema  implies  that  if  an  invocation  j  6  Iq,  then  all  invocations  k  such  that 
k  —*q  j  also  belong  to  I'0.  The  invocation  y9;  occurs  in  Dp.  Hence,  if  is  not 
present  in  Df>,  it  follows  that  y9;_!  and  y8»  are  unordered  relative  to  each  other  in 
Hq ,  otherwise  Hp  could  not  be  based  on  a  C-schema. 

By  the  inductive  hypothesis,  D°  is  consistent  with  Hp.  The  order  of  all  invo¬ 
cations  in  D"'~x  is  the  same  as  in  D" ,  except  for  ft  ;_t  and  yS  .  But  these  invocations 
are  unordered  in  Hp.  Hence,  D''~l  is  consistent  with  H0. 

4.3.4.4.  Proof  that  LegalResponses  ^*  Dh  ~ 1  >  =  LegalRexpomes 0*D’  * 

DhJrX  and  Dk  differ  in  the  order  of  y3.  ~i  and  y3.-,  and  as  shown  above  these 
invocations  are  unordered  in  Hp.  There  are  two  possible  cases: 
time*  fi <  _x)  <  timeifii)  or  time'  fl  ■  -  \l  >  time1! 3;).  The  proof  in  both  cases  is  similar 
and  will  be  developed  in  parallel,  with  the  second  case  in  square  brackets. 

Assume  time*fii-\*  <  time*  fi  [  resp.  time'fi  *  <  time1  ft  _ L 1  ],  that  is 
/S  _!  €  before < y9  >  [  y3  €  before* -i*  ].  Now  Hp  is  an  extension  of  before*} 3 
[before* 1 3  )  ].  Since  Hp  is  a  projection  of  a  C-history  and  yS  _i  and  ft  are  unordered 
in  it,  it  follows  that  y9'  does  not  conflict  with  yS  _  L  [  yS  _j  does  not  conflict  with  /? •]. 
Dr'  is  a  development  consistent  with  Hn  of  the  form  yyS-^yS  5.  The  absence  of  a 
conflict  means  that  Lega.lRe*poruexp*yf}  _ty8  5i  -  LegalReipon*e^n*yfi[fi 
which  gives  the  result. 


4.4.  Summary 


In  this  chapter,  we  introduced  the  concept  of  schemas  and  described  what  it 
means  for  a  schema  to  be  correct.  We  defined  C-schemas  and  proved  that  they  are 
correct  with  respect  to  C-histories.  In  the  next  chapter,  we  use  a  C'-schema  to 
develop  an  efficient  implementation  of  distributed  objects. 


CHAPTER  5 


Implementation 


5.1.  Introduction 

In  this  chapter,  we  use  a  C-schema  to  develop  two  implementations  for  distri¬ 
buted  objects  —  a  piggybacked  implementation  and  a  concurrent  one.  In  both  imple¬ 
mentations,  the  result  of  an  invocation  is  returned  as  soon  as  it  is  executed  at  the 
local  site,  without  requiring  the  user  to  wait  for  a  communication  with  remote 
sites.  The  latency  described  in  Section  1.2  is  thus  eliminated.  The  information 
sent  between  sites  by  the  piggybacked  implementation  is  added  to  messages 
already  being  used  by  the  scheduler  for  synchronization.  This  is  an  immediate 
advantage  in  systems  where  the  number  of  messages  in  the  system  is  a  dominant 
factor  in  system  performance,  because  the  piggybacked  implementation  requires  no 
additional  messages.1  The  concurrent  implementation  is  a  modification  of  the  pig¬ 
gybacked  one  in  which  information  about  invocations  is  exchanged  between  sites  in 
parallel  with  the  execution  of  other  operations.  This  implementation  is  intended 
for  use  in  systems  where  the  number  of  messages  sent  is  not  a  major  constraint. 
Its  advantage  is  that  it  leads  to  a  more  even  distribution  of  work  than  does  the  pig¬ 
gybacked  implementation. 

lIn  practice  it  is  not  possible  to  piggyback  arbitrary  amounts  of  information  on 
existing  messages  without  causing  them  to  be  fragmented  into  a  number  of  smaller 
packets  for  transmission.  Thus  there  could  be  some  additional  cost.  We  show  later 
how  to  keep  the  amount  of  information  piggybacked  small. 
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5.2.  The  piggybacked  implementation 

In  the  piggybacked  implementation  of  an  object  O,  a  component  is  placed  at 
each  of  the  sites  in  AccessibleQ.  Each  component  has  a  copy  of  DATAq  and  the 
definitions  of  operations  in  OPq,  and  can  independently  perform  operations  on  its 
copy  of  data.  When  an  invocation  is  scheduled  at  a  site,  the  component  at  that  site 
performs  the  requested  operation  using  its  copy  of  data,  and  returns  the  result 
immediately.  It  then  instructs  each  of  the  other  components  to  perform  the  same 
operation  on  their  copies.  Components  are  informed  of  invocations  scheduled  at 
other  components  in  such  a  way  that  the  resulting  implementation  corresponds  to  a 
C-schema.  The  correctness  of  a  C-schema  (Section  4.3.4)  implies  that  the  results 
returned  by  the  individual  components  are  consistent  with  the  specification  of  the 
object  as  a  whole. 

When  an  invocation  is  scheduled  at  a  site,  it  is  assigned  a  timestamp.  Times¬ 
tamps  have  the  property  that  of  any  two  invocations  scheduled  at  the  same  site, 
the  one  scheduled  earlier  has  a  smaller  timestamp.  The  timestamp  of  an  invoca¬ 
tion,  with  the  site  name  appended,  is  called  the  invocatwn-lD.  Note  that  times¬ 
tamps  and  invocation-ID’s  can  be  generated  locally  at  each  site,  and  require  no  glo¬ 
bal  synchronization.  A  notification  for  a  particular  invocation  is  a  record  contain¬ 
ing  the  invocation-ID,  the  name  of  the  object,  the  arguments  for  the  invocation,  and 
the  name  of  a  destination  site.  Notifications  are  sent  from  the  component  where  an 
invocation  is  scheduled  to  each  of  the  other  components.  When  a  component 
receives  a  notification,  it  executes  the  named  operation  on  its  copy  of  data. 


In  the  piggybacked  implementation,  copies  of  notifications  created  at  a  site  -• 
are  ordered  according  to  timestamp  and  are  piggybacked  on  all  subsequent  syn¬ 
chronization  messages  sent  from  $.  When  a  synchronization  message  arrives  at  a 
site  t,  the  piggybacked  notifications  are  first  processed  in  order.  Then  the  syn¬ 
chronization  message  itself  is  acted  upon.  For  each  piggybacked  notification,  the 
following  occurs.  If  t  is  the  destination  site  for  the  notification,  the  notification  is 
delivered  to  the  component  at  t,  which  executes  the  nmed  operation.  Otherwise,  a 
copy  of  the  notification  is  saved,  and  is  piggybacked  on  all  synchronization  mes¬ 
sages  that  are  subsequently  sent  from  t.  Thus,  copies  of  a  notification  may  travel 
from  site  to  site  and  may  reach  its  destination  by  many  different  paths.  Below,  we 
show  how  the  invocation-ID’s  are  used  to  ignore  all  but  the  first  copy  that  arrives 
at  a  site,  and  to  purge  copies  of  notifications  from  the  system  once  a  copy  has 
reached  its  destination.  The  method  used  is  similar  to  the  algorithm  in  [47], 

5.2.1.  Transmission  of  notifications 

The  algorithm  followed  at  a  site  v  to  distribute  notifications  is  shown  in  Fig¬ 
ure  5  1.  Each  site  •>  maintains  a  buffer  Outgoing  ^  of  outgoing  notifications. 
Outgoing  .  contains  notifications  originating  at  s  as  well  as  copies  that  arrive  at  -i 
en  route  to  other  destinations.  A  copy  of  a  notification  remains  in  Outgoing ,,  and 
continues  to  be  piggybacked  on  all  outgoing  synchronization  messages,  until  site  s 
learns  that  the  destination  has  received  a  copy. 

Observe  that  if  a  notification  created  at  carries  a  smaller  timestamp  than 
another  notification  n2  also  created  at  then  a  copy  of  n  {  must  be  received  at  any 
other  site  t  before  the  first  copy  of  n2  is  received  there.  This  is  because  if  nx  does 
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Whenever  a  synchronization  message  m  is  being  sent  from  she  •>'  to  site  t: 

•  Piggyback  on  m  a  copy  of  all  notifications  in  Outgoing .,  in  order. 

•  Piggyback  the  values  of  LargestSeen s.  and  TheirLargest 

When  a  synchronization  message  is  received  from  site  f: 

•  For  each  site  r,  accept  all  piggybacked  notifications  originating  from  v 
whose  timestamps  are  greater  than  LargestSeen ,[(/],  and  set  the  value 
of  LargestSeen  ,[u]  to  the  largest  such  timestamp. 

•  Process,  in  order,  all  notifications  with  destination  s,  and  append  all 
other  notifications  to  Outgoings  preserving  their  order 

•  Set  the  values  of  T  he  ir  Largest  Xt][v\  to  the  piggybacked  values  of 
LargesnSee/i(U’]. 

•  Set  the  value  of  TheirLargest  s[v][w]  to  the  larger  of  T heirLargest  ,[v)[u'] 
and  the  piggybacked  value  of  TheirLargest({v]{u ;]. 

•  Delete  from  Outgoings  all  notifications  from  site  iv  to  site  v  with  time- 
stamps  smaller  than  or  equal  to  TheirLargest  ^[v^w]. 

Figure  5.1.  Piggybacked  implementation,  as  followed  by  site  s. 


not  arrive  first  at  t  by  another  path,  then  a  copy  of  nl  will  be  piggybacked  on  the 
same  path  of  synchronization  messages  as  n>,  and  will  be  ordered  before  n>. 
Hence,  if  each  site  ■>  keeps  track  of  the  largest  timestamp  it  has  observed  on  a 
notification  originating  from  each  other  site  t,  it  can  ignore  all  notifications  with 
smaller  timestamps,  as  it  must  have  already  received  a  copy. 


At  each  site  s,  the  array  element  LargestSeen^[c]  records  the  largest  times¬ 
tamp  that  s  has  observed  on  a  notification  originating  from  v.  Notifications  with 
smaller  timestamps  are  ignored  by  s.  The  array  element  TheirLargest^[i'}[iv] 


records  at  s  the  value  of  Larffes/Seen^tu1]  at  the  time  of  the  last  message  that  v 
sent  to  .s.  Site  .*  deletes  from  Outgoing any  notification  originating  at  ic  with  des¬ 
tination  v  that  carries  a  smaller  timestamp  than  TheirLargest^[u][w],  because  v 
must  have  already  received  a  copy.  For  the  case  w  =  this  means  that  .$  deletes 
copies  of  any  notification  created  at  s  that  it  knows  to  have  been  received  at  the 
destination  r. 

5.2.2.  Correctness 

We  now  show  that  the  piggybacked  implementation  corresponds  to  a  C- 
schema.  A  component  executes  operations  in  the  order  in  which  it  creates 

notifications  or  first  receives  them  from  other  components.  We  must  show  that 

whenever  an  invocation  i  is  executed  at  a  component,  the  invocations  executed  at 
that  component  form  a  development  that  is  consistent  with  the  (-history  specified 
by  a  C-schema."  In  other  words,  we  must  show  that  if  we  have  two  invocations  i  x 
and  1 2  on  the  same  object  such  that  1 t  -Nyc  i  >,  then  i [  is  executed  before  i2  at  all 
the  components.  In  Section  2.6,  we  observed  that  if  i  j  -*„ys  i>,  then  there  is  a 
message  path  from  to  iy,  that  is,  there  is  a  sequence  of  messages  from  s,  where  tx 
is  scheduled  to  t,  where  i  >  is  scheduled.  The  first  message  in  this  sequence  is  sent 

from  s  after  ix  is  performed  there,  and  the  last  one  arrives  at  t  before  i2  is  per¬ 

formed.  Each  intermediate  message  is  sent  from  the  destination  of  the  previous 
one  in  the  sequence,  and  the  sending  occurs  after  the  receipt  of  the  previous  mes¬ 
sage.  Because  such  a  message  path  exists,  a  copy  of  the  notification  for  i;  will  be 

“Recall  that  a  development  is  simply  a  sequence  representing  the  order  in 
which  invocations  are  processed  at  a  component. 
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piggybacked  along  this  path  and  will  arrive  at  t  before  i2  occurs  there.  'For  the 
case  s  =  t,  it  is  convenient  to  think  of  the  scheduler  as  ordering  consecutive  invo¬ 
cations  at  the  same  site  by  sending  a  message  from  that  site  to  itself,  though  this 
need  not  actually  be  implemented.)  Hence,  il  will  be  performed  before  i2  at  site  t. 

We  have  seen  that  a  copy  of  the  notification  for  i L  arrives  at  t  before  i2  occurs 
there.  At  the  other  sites,  if  the  notification  for  does  not  arrive  earlier  by  another 
path,  a  copy  will  be  piggybacked  on  the  same  sequence  of  messages  that  the 
notification  for  l2  is  piggybacked  upon,  and  will  be  ordered  before  the  notification 
for  i2-  Hence,  every  other  component  will  also  execute  i 1  before  i2. 

This  shows  that  the  order  in  which  invocations  are  performed  at  each  com¬ 
ponent  is  consistent  with  a  C-schema.  The  piggybacked  implementation  is  hence 
correct  when  a  C-scheduler  is  used. 

5.2.3.  Optimizations 

A  number  of  optimizations  are  possible.  An  obvious  drawback  is  that 
notifications  are  piggybacked  on  synchronization  messages  that  might  never  lead  to 
their  intended  destination.  This  could  be  avoided  if  the  scheduler  indicates  which 
object  a  particular  synchronization  message  refers  to.  For  example,  if  a  lock-based 
scheduling  method  is  being  used,  the  objects  corresponding  to  a  particular  lock 
acquisition  or  release  message  are  known.  Notifications  could  then  be  piggybacked 
only  on  those  synchronization  messages  that  refer  to  the  objects  in  question. 

Another  optimization,  which  is  simple  to  implement,  is  to  not  piggyback  on  a 
message  to  a  site  t  those  notifications  in  Outgoing that  have  already  been  pig- 


gybacked  on  an  earlier  message  to  t,  even  if  their  timestamps  are  larger  than 
TheirLargests[u][u']. 

The  sizes  of  the  buffers  Outgoings  as  well  as  the  number  of  notifications  pig¬ 
gybacked  can  also  be  controlled  by  periodically  broadcasting  LastSeen.  to  other 
sites.  This  enables  them  to  promptly  update  their  values  of  TheirLargest  and  dis¬ 
card  copies  of  notifications  that  have  already  reached  their  destinations.  If  this  is 
carried  out  frequently  enough,  the  only  notifications  that  should  be  in  the  buffers 
are  those  in  transit;  that  is,  those  for  which  the  destination  has  not  received  a  copy. 
Broadcasting  the  values  of  LastSeen,  however,  adds  to  the  message  traffic  in  the 
system. 

5.2.4.  Discussion 

Let  us  consider  the  features  of  the  piggybacked  implementation.  First,  a  com¬ 
ponent  can  return  the  results  of  an  invocation  when  it  is  carried  out  locally, 
without  having  to  wait  for  the  other  components  to  be  notified  of  it.  This  elim¬ 
inates  latency.  Second,  no  additional  messages  are  required  for  communication 
between  components:  all  the  necessary  information  is  piggybacked  on  messages 
already  used  by  the  scheduler  for  synchronization.  Third,  the  implementation  is 
independent  of  the  algorithm  used  by  the  scheduler,  provided  that  it  falls  into  the 
class  of  C-schedulers.  Even  if  the  scheduler  uses  information  like  the  semantics  of 
operations,  the  current  state  of  objects,  or  the  arguments  for  a  particular  invoca¬ 
tion,  the  messages  it  sends  are  sufficient  for  the  implementation  to  be  correct. 

The  piggybacked  implementation  shows  that  if  data  are  to  be  accessed  from 
multiple  sites  in  a  distributed  system,  they  can  be  replicated  at  these  sites  in  a  way 


that  requires  no  extra  cost  in  terms  of  latency  or  number  of  messages.  It  must  be 
pointed  out  that  a  scheduler  that  permits  access  to  an  object  from  multiple  sites 
may  have  to  use  a  large  number  of  messages  for  synchronization.  This,  however,  is 
a  cost  resulting  from  permitting  distributed  access  to  data,  and  is  independent  of 
whether  data  are  replicated  or  not.  The  piggybacked  implementation  demonstrates 
that  the  advantages  of  increased  availability  in  the  presence  of  failures  and  the 
benefits  of  placing  a  copy  of  data  at  sites  easily  accessible  by  users  need  be  bal¬ 
anced  only  against  the  costs  of  storing  multiple  copies  of  data  and  of  having  to  pro¬ 
cess  more  than  one  copy.  These  costs  are  unavoidable  if  data  are  replicated. 

5.3.  The  concurrent  implementation 

The  piggybacked  implementation  has  the  following  property.  A  synchroniza¬ 
tion  message  arriving  at  a  site  to  schedule  an  invocation  i  for  execution  may  have 
a  large  number  of  notifications  piggybacked  on  it.  One  result  is  that  synchroniza¬ 
tion  messages  may  become  very  large.  Moreover,  the  execution  of  i  will  be  delayed 
until  all  the  piggybacked  notifications  have  been  processed  and  all  invocations 
ahead  of  i  have  been  performed.  Note  that  this  delay  is  different  from  the  latency 
described  earlier,  which  was  a  wait  for  a  message  to  be  transmitted  to  a  remote  site 
and  for  a  reply  to  be  received.  The  delay  described  here  is  a  wait  for  local  process¬ 
ing  to  take  place,  which  usually  takes  less  time  than  that  required  for  message 
transmission.  Nevertheless,  this  bursty  pattern  of  execution,  where  a  large 
number  of  operations  may  have  to  be  executed  when  a  synchronization  message 
arrives  and  relatively  little  is  done  at  other  times,  could  lead  to  inefficient  use  of 
computational  resources.  In  systems  where  this  is  a  performance  issue,  and  where 


the  number  of  messages  in  the  system  is  not  a  major  constraint,  the  concurrent 
implementation  gives  better  performance. 

In  this  variation,  an  invocation  descriptor  is  piggybacked  on  synchronization 
messages.  A  descriptor  for  an  invocation  consists  of  the  invocation-ID  and  a  desti¬ 
nation  site.  Notifications  are  transmitted  directly  to  the  destinations  using  an 
atomic  broadcast ,  which  has  the  following  properties. 

1.  The  data  broadcast  are  either  received  at  all  operational  destinations,  or  at 
none  at  all,  even  if  site  failures  occur  during  the  broadcast.  Moreover,  if  an 
atomic  broadcast  B2  is  initiated  from  a  site  after  another  atomic  broadcast  Bx 
from  the  same  site,  and  if  the  data  broadcast  by  B2  are  received  at  its  destina¬ 
tions,  then  the  data  broadcast  by  Bx  are  also  received  at  its  destinations. 

2.  If  two  atomic  broadcasts  made  from  the  same  site  have  destinations  in  com¬ 
mon,  the  data  are  received  at  overlapping  destinations  in  the  order  that  the 
broadcasts  were  initiated. 

3.  If  the  data  from  an  atomic  broadcast  Bx  is  received  at  a  site  before  an  atomic 
broadcast  B2  is  initiated  from  that  site,  then  the  data  from  Bx  are  received 
before  the  data  from  S2  at  any  overlapping  destination. 

A  number  of  protocols  have  been  proposed  for  implementing  broadcasts  with 
these  and  similar  properties  [8,  14,  15,  41],  In  [8],  we  describe  a  communication 
sub-system  that  provides  an  atomic  broadcast  as  a  primitive  operation.  We  denote 
the  initiation  of  the  atomic  broadcast  for  invocation  i  as  AtBeasth  i. 

The  concurrent  implementation  may  be  described  in  terms  of  two  rules:  a 
broadcast  ordering  rule,  which  governs  the  order  in  which  notifications  are 


transmitted,  and  a  blocking  rule,  which  specifies  when  the  execution  of  certain 
operations  must  wait  for  others.  As  in  the  piggybacked  implementation,  the  result 
of  an  invocation  is  returned  once  it  has  been  completed  at  the  local  site.  The 
atomic  broadcast  of  its  notifications  may  be  initiated  after  an  arbitrary  amount  of 
time,  but  must  follow  the  broadcast  ordering  rule:  if  an  invocation  ix  is  scheduled 
before  another  invocation  i2  at  the  same  site,  then  AtBca$t(ix )  occurs  before 
AtBcastlio).  This  condition  can  be  enforced  locally.  It  is  possible,  as  an  optimiza¬ 
tion,  to  package  more  than  one  notification  into  the  same  atomic  broadcast,  pro¬ 
vided  their  order  is  observed  at  the  destinations. 

At  a  destination,  operations  are  performed  in  the  order  that  notifications  are 
received.  Note  that  the  concurrent  implementation  permits  AtBcast(i)  to  be  ini¬ 
tiated  any  time  after  i  is  scheduled,  provided  the  broadcast  ordering  rule  is  not 
violated.  Thus,  notifications  can  be  transmitted  in  such  a  way  that  they  arrive  at 
their  destinations  spaced  out  over  time,  instead  of  bunched  up  behind  synchroniza¬ 
tion  messages,  as  in  the  piggybacked  implementation.  This  distributes  the  load 
arising  from  executing  operations,  leading  to  fewer  potential  bottlenecks  in  the 
utilization  of  system  resources.  The  trade-off  is  the  increased  number  of  messages 
in  the  system. 

The  blocking  rule  remains  to  be  described.  The  piggybacking  of  invocation 
descriptors  on  synchronization  messages  ensures  that  if  ix  and  i->  are  two  invoca¬ 
tions  on  the  same  object  scheduled  at  sites  s  and  t  respectively,  and  if  -*$y$  i2, 
then  a  copy  of  the  piggybacked  descriptor  for  1 t  is  received  at  t  before  i-:  is  per¬ 
formed  there  (Section  5.2.2).  However,  because  the  transmission  of  notifications 


may  be  delayed  arbitrarily,  the  notification  for  i  L  may  have  not  yet  arrived  at  t, 
and  the  corresponding  operation  not  executed.  The  blocking  rule  is  enforced  if  an 
invocation  descriptor  for,  say,  has  been  received  at  a  site  but  the  corresponding 
notification  has  not  arrived.  The  rule  states  that  in  this  situation,  if  any  invocation 
i 2  is  scheduled  at  that  site  after  the  receipt  of  the  descriptor  for  iy,  then  the  execu¬ 
tion  of  1 2  is  blocked  until  the  notification  for  has  been  received  and  the 
corresponding  operation  performed.  Note  that  this  kind  of  blocking  occurs  only  if 
the  transmission  of  a  notification  is  unduly  delayed.  If  notifications  are  broadcast 
promptly  after  invocations  are  scheduled,  this  situation  should  be  infrequent. 

5.3.1.  Correctness 

To  show  that  the  concurrent  implementation  corresponds  to  a  C-schema,  we 
must  show  that  if  we  have  two  invocations  and  t2  on  the  same  object  and  if 
1 1  — *sys  !2>  then  is  performed  before  t.>  at  all  components. 

Let  and  i2  be  scheduled  at  sites  $  and  t  respectively.  There  is  a  path  of  syn¬ 
chronization  messages  from  s  to  t  such  that  a  piggybacked  descriptor  for  1 t  will 
arrive  at  t  before  i2  is  performed  there.  If  the  notification  for  it  has  not  already 
arrived  at  t,  the  blocking  rule  ensures  that  the  execution  of  i2  will  be  delayed  until 
the  notification  for  arrives  and  is  performed  at  t.  Hence,  ii  occurs  before  i2  at 
the  component  at  t. 

To  show  that  i  L  is  performed  before  i2  by  the  other  components  as  well,  we 
consider  two  cases.  If  s  =  t,  then  the  broadcast  ordering  constraint  ensures  that 
AtBcastU . )  is  initiated  before  AtBcastii  2).  The  properties  of  an  atomic  broadcast 
ensure  that  the  notification  for  i >  arrives  before  that  for  r.  at  all  destinations,  so  /, 


occurs  before  i2  at  all  components.  If  s  *  t,  we  know  from  the  argument  above 
that  the  notification  for  arrives  at  t  before  AtBcast(i2)  is  initiated  there.  Again, 
the  properties  of  atomic  broadcasts  ensure  that  the  notification  for  reaches  all 
destinations  before  that  for  i2.  Hence,  1 1  is  performed  before  t2  everywhere. 

It  follows  that  the  concurrent  implementation  is  consistent  with  a  C'-schema, 
and  is  hence  correct  when  a  C-scheduler  is  used. 

5.3.2.  Discussion 

The  concurrent  implementation,  like  the  piggybacked  implementation,  elim¬ 
inates  latency.  It  spreads  out  the  execution  of  remote  operations  over  time,  thus 
leading  to  better  utilization  of  system  resources.  The  decision  as  to  when  to  per¬ 
forin  an  atomic  broadcast  to  distribute  notifications  is  left  unspecified.  This  opens 
the  possibility  of  a  message  scheduler  being  used  to  make  this  decision  based  on  the 
current  load  on  the  network,  the  state  of  the  message  transmission  buffers,  and 
other  factors.  Such  a  message  scheduler  could  have  a  significant  influence  on  the 
performance  of  the  concurrent  implementation. 

The  concurrent  implementation  of  distributed  objects  makes  a  high  level  of 
concurrency  possible  in  replicated  distributed  systems.  If  implemented  at  a 
sufficiently  low  level  in  the  system,  this  concurrency  may  be  obtained  in  a  way  that 
is  transparent  to  high  level  application  programs.  Thus,  applications  could  perform 
very  efficiently  without  requiring  complex  programming  to  achieve  this  level  of 
performance.  This  idea  is  being  explored  in  the  /S/S  system,  currently  being 
developed  at  Cornell  [6,  7,  9,  10]. 


CHAPTER  6 


Failures 


6.1.  Introduction 


One  of  the  reasons  for  replicating  data  is  to  keep  objects  accessible  when  sites 
fail.  In  this  chapter,  we  discuss  how  the  implementations  of  distributed  objects 
given  in  Chapter  4  may  be  integrated  with  a  failure  handling  mechanism.  We  first 
consider  a  roll-back  mechanism,  where  a  failure  causes  transactions  to  be  undone 
and  re-executed  using  another  copy  of  the  data.  We  then  discuss  a  roll-forward 
mechanism,  where  transactions  in  progress  at  a  failed  site  are  completed  by 
another  site  that  takes  over  from  it. 

The  type  of  failures  we  consider  are  fail-stop  site  failures  [39]:  a  site  fails  by- 
halting  all  execution  and  sends  no  more  messages;  its  failure  is  detectable  by  every 
other  site  in  the  system.  Any  information  about  the  current  state  of  data  stored  at 
a  site  is  lost  when  it  fails;  when  a  site  recovers,  it  copies  up-to-date  information 
from  operational  sites.  We  do  not  consider  failures  where  a  site  takes  incorrect 
actions  while  remaining  operational  or  where  a  site  sends  out  spurious  messages. 
This  precludes  network  partitioning,  where  a  set  of  sites  remain  operational,  but 
are  unable  to  communicate  with  the  other  sites.  The  communication  medium  is 
assumed  to  be  error-free. 

The  abstraction  of  fail-stop  sites  can  be  implemented  to  a  high  degree  of  accu¬ 
racy  in  software  [40],  A  software  layer,  called  the  failure  detector,  monitors  the 
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sites  and  the  communication  medium.  It  implements  fail-stop  sites  by  shutting 
down  any  site  suspected  of  malfunctioning,  cutting  off  communication  to  and  from 
the  site,  and  announcing  to  the  rest  of  the  system  that  the  site  has  failed.  In  the 
event  of  network  partitioning,  the  failure  detector  may  block  activities  until  the 
partition  is  resolved. 

6.2.  Roll-back  recovery 

Many  transaction-based  systems  handle  failures  by  undoing  the  effects  of 
ongoing  transactions  that  have  accessed  failed  sites.  These  transactions  are  re- 
executed  when  the  failed  site  recovers,  or  if  the  data  are  replicated,  they  are  re- 
executed  using  another  copy  of  the  data.  In  such  systems,  the  changes  made  by  a 
transaction  to  the  data  of  an  object  are  not  made  permanent  until  the  transaction 
terminates,  at  which  time  it  may  commit  or  abort.  A  commit  represents  normal 
termination,  while  an  abort  results  in  each  object  being  restored  to  a  state  in  which 
it  would  have  been  had  the  transaction  not  executed.  Thus,  when  a  site  fails,  the 
failure  handler  simply  aborts  all  ongoing  transactions  that  have  accessed  the  failed 
site.  These  transactions  are  later  re-executed. 

Hadzilacos  has  studied  the  scheduling  problem  in  an  environment  where  tran¬ 
sactions  may  be  aborted  as  a  result  of  system  failures,  and  discusses  restrictions  on 
the  class  of  serializable  histories  that  are  necessary  or  useful  in  this  setting  [23]. 
He  observes  that  when  failures  can  occur,  the  notion  of  serializability  should  be 
defined  in  terms  of  committed  transactions;  that  is,  it  must  account  for  the  fact  that 
a  system  failure  may  occur  at  any  time,  leading  to  an  abort  of  one  or  more  of  the 
uncommitted  transactions.  The  essential  condition  is  that  the  system  history  must 


be  commit  serializable:  the  history  formed  by  including  only  the  committed  transac¬ 
tions  should  be  serializable,  and  this  must  be  true  of  any  prefix  of  the  system  his¬ 
tory  as  well.  It  is  also  shown  that  any  history  belonging  to  the  class  CPSR 
(corresponding  to  C-histories  in  our  model)  is  commit  serializable. 

Hadzilacos  also  discusses  other  properties  desirable  of  a  scheduler  operating  in 
a  system  where  transactions  may  be  aborted.  A  system  history  is  said  to  be  recov¬ 
erable  if  the  abortion  of  any  ongoing  transaction  does  not  invalidate  the  results  of  a 
committed  transaction.  If  a  scheduler  generates  system  histories  that  are  not 
recoverable,  an  abort  could  require  that  a  committed  transaction  be  undone,  which 
is  undesirable  and  may  even  be  impossible.  Finally,  he  strengthens  the  notion  of 
recoverability  to  strictness.  Strict  histories  have  two  advantages.  First,  they  avoid 
the  problem  of  cascading  aborts  —  the  situation  where  the  abort  of  one  transaction 
requires  that  other  active  transactions  be  aborted  as  well,  because  their  actions 
depended  on  some  of  the  actions  of  the  aborted  transaction.  While  cascading  aborts 
does  not  compromise  correctness,  maintaining  information  about  the  dependencies 
between  transactions  is  a  complex  and  inefficient  task  [20,  35].  The  other  advan¬ 
tage  of  strict  histories  is  that  a  transaction  can  be  aborted  simply  by  restoring  each 
object  to  the  state  that  it  was  in  at  the  time  the  transaction  started  executing, 
regardless  of  the  actions  of  any  concurrently  executing  transactions. 

The  results  of  Hadzilacos  are  a  strong  argument  in  favor  of  restricting  the 
scheduler  to  produce  only  strict  histories,  if  transactions  can  be  aborted.  Hadzi¬ 
lacos  shows  how  to  modify  commonly  used  scheduling  disciplines  -  two-phase  lock¬ 
ing,  timestamp  ordering,  and  serialization  graph  testing  -  so  that  they  may  be 


used  in  an  environment  where  aborts  may  occur.  The  changes  are  minor,  and 
ensure  that  only  strict  histories  are  generated.  Since  the  result  we  proved  in 
Chapter  4  holds  for  any  C-scheduler,  it  remains  valid  when  such  changes  are  made. 

We  now  outline  how  aborts  can  be  included  in  the  model  of  Chapter  2.  Each 
object  is  considered  to  have  two  special  operations:  begin  and  abort.  Invoking 
abort! Tid)  restores  the  state  of  the  object  to  the  state  that  existed  when  begim  Tid) 
was  invoked.  Hence,  if  the  scheduler  generates  only  strict  histories,  a  transaction 
may  be  aborted  by  invoking  an  abort  operation  for  every  object  it  accesses.  To 
allow  for  this,  the  set  of  all  possible  transactions  TRANS  is  extended  as  follows. 
Each  transaction  T  in  TRANS  is  preceded  by  an  invocation  of  begin  for  each 
object  that  the  transaction  accesses.  Further,  for  each  transaction  T  in  TRANS 
and  each  invocation  i  in  T,  a  new  transaction  is  added  to  TRANS,  which  invokes 
an  abort  operation  after  i  for  every  object  that  T  accesses,  and  has  no  subsequent 
invocation.  This  reflects  the  fact  that  a  transaction  may  be  aborted  at  any  time 
during  its  execution  because  of  a  failure.  To  the  scheduler,  which  we  assume  gen¬ 
erates  only  strict  histones,  begin  and  abort  invocations  appear  as  invocations  of 
normal  operations. 

A  roll-back  recovery  mechanism  such  as  the  one  described  above  can  be  used 
in  conjunction  with  a  piggybacked  or  a  concurrent  implementation  of  objects  in 
much  the  same  way  that  it  can  be  used  with  a  synchronous  implementation.  It 
may  appear  that  the  implementation  could  be  incorrect  in  the  presence  of  failures 
because  a  piggybacked  notification  or  descriptor  could  arrive  at  a  site  en  route  to 
another,  and  be  lost  at  the  intermediate  site  if  the  site  fails.  However,  if  an  invoca- 
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tion  is  scheduled  after  a  failure,  there  must  be  some  path  of  synchronization  mes¬ 
sages  that  leads  to  it  from  every  invocation  ordered  before  it,  otherwise  the  execu¬ 
tion  could  be  incorrect.  The  required  notification  or  descriptor  will  reach  its  desti¬ 
nation  along  this  path,  even  if  copies  along  other  paths  are  lost. 

A  minor  modification  may  be  required,  depending  on  how  transactions  are 
aborted.  If  abort  operations  are  invoked  independently  from  the  normal  schedul¬ 
ing  mechanism,  then  it  is  possible  for  a  notification  from  an  aborted  transaction  to 
arrive  at  a  component  after  the  transaction  has  been  aborted  there.  If  the  transac¬ 
tion  ID’s  are  included  in  notifications,  such  notifications  can  be  detected  and 
ignored.  This  problem  does  not  arise  if  transaction  aborts  are  integrated  with  the 
normal  scheduling  mechanism. 

6.3.  Roll-forward  recovery 

Aborting  and  re-executing  transactions  is  just  one  way  of  handling  failures 
An  alternative,  especially  when  data  are  replicated,  is  to  provide  a  roll-forward 
mechanism.  By  this  we  mean  that  transactions  in  progress  at  a  site  that  fails  are 
continued  and  completed  by  another  site.  Recovery  schemes  along  these  lines  are 
presented  in  [10,  39].  The  failure  of  a  site  is  thus  masked  from  a  user  except 
perhaps  as  an  increase  in  response  time  when  a  failure  occurs),  unless  all  the  >ites 
where  a  piece  of  data  is  replicated  happen  to  fail.  This,  however,  should  occur  rela¬ 
tively  rarely. 

When  a  roll-forward  scheme  is  used  in  conjunction  with  a  piggybacked  or  con¬ 
current  implementation,  a  number  of  problems  arise.  These  are  mainly  due  to  the 
fact  that  the  failure  or  recovery  of  a  site  can  be  detected  at  different  times  and  in 


differing  orders  by  other  sites  in  the  system.  Despite  this,  sites  must  react  con¬ 
sistently  to  the  failure  and  provide  correct  responses  for  ongoing  transactions.  We 
illustrate  the  problems  by  means  of  a  few  examples. 

Assume  that  one  of  the  invocations  of  an  ongoing  transaction  is  scheduled  at  a 
site  that  subsequently  fails.  At  another  site,  a  component  of  the  invoked  object 
may  observe  any  of  the  following  outcomes. 

1.  The  notification  for  the  invocation  arrives,  and  the  failure  is  detected  later. 

2.  The  failure  is  first  detected,  and  the  notification  arrives  later. 

3.  The  failure  is  detected,  but  the  notification  never  arrives  (because  the  site 

failed  before  sending  the  notification). 

If  a  component  detects  the  failure  and  .vishes  to  continue  execution  of  the 
transaction,  it  cannot  tell  whether  a  notification  from  the  failed  site  will  arrive  or 
not  'ie  it  cannot  distinguish  between  cases  2  and  3).  Also,  if  some  of  the  com¬ 
ponents  receive  the  notification  before  detecting  the  failure  (case  1)  and  others 
receive  it  after  detecting  the  failure  'case  2),  the  actions  they  take  must  not  lead  to 
inconsistencies  in  copies  of  the  object’s  data. 

Another  kind  of  problem  is  exemplified  in  the  following  scenario.  Assume  that 
it  is  agreed  that  the  operational  component  with  the  lowest  numbered  ID  takes 
over,  completes  and  responds  to  invocations  interrupted  by  a  site  failure.  Let  an 
invocation  be  scheduled  at  site  - .  which  subsequently  fails.  The  lowest  numbered 
component,  say  at  site  completes  the  invocation.  Assume  that  t  also  fails.  If  the 
component  that  now  has  the  lowest  ID  detects  the  failure  of  site  t  before  that  of  site 
s,  it  cannot  know  that  •  took  over  from  s  and  completed  the  invocation.  Hence,  it 


would  incorrectly  re-execute  the  invocation  and  attempt  to  provide  a  second 
response. 

The  order  in  which  recoveries,  not  just  failures,  are  detected  is  also  an  issue. 
Recall  that  if  a  site  fails,  it  loses  all  information  about  the  current  state  of  data. 
Hence,  the  usual  action  taken  by  a  component  when  it  recovers  is  to  obtain  an  up- 
to-date  copy  of  data  from  an  operational  component.  For  the  recovered  component 
to  continue  to  behave  correctly,  it  must  receive  notifications  of  all  subsequent  invo¬ 
cations  scheduled  at  other  components  of  the  same  object.  However,  if  the  other 
components  detect  the  recovery  at  different  times,  they  may  neglect  to  send  the 
recovered  component  notifications  it  should  receive. 

The  problems  described  above  arise  when  sites  observe  failures  and  recoveries 
at  different  times  and  in  different  orders,  ihese  situations  are  handled  easily  when 
a  synchronous  implementation  is  used  for  objects,  because  components  coordinate 
their  actions  for  every  invocation.  In  a  non-synchronous  implementation,  however, 
components  are  permitted  to  get  out  of  step  with  each  other,  and  the  timing  and 
order  in  which  failures  and  recoveries  are  observed  becomes  an  issue 

One  solution  is  for  the  components  of  an  object  to  run  an  agreement  protocol 
each  time  a  component  detects  a  failure  or  recovery.  They  can  then  agree  on  the 
status  of  outstanding  notifications  and  take  a  consistent  set  of  actions  to  handle  the 
failure  or  recovery.  Another  solution  is  to  integrate  the  failure  detector  with  the 
communication  sub-system  used  by  the  scheduler.  In  (8),  we  describe  such  an 
approach.  Our  communication  sub-system  provides  a  set  of  broadcast  primitives 
that  order  message  delivery  with  respect  to  failure  detection  and  recoveries.  For 


example,  it  is  not  possible  for  data  sent  in  the  same  broadcast  to  a  set  of  sites  to  be 
received  at  some  of  the  destinations  before  a  failure  is  detected  there,  while  arriv¬ 
ing  at  other  destinations  after  the  same  failure  is  detected  at  those  sites.  More¬ 
over,  each  site  observes  failures  and  recoveries  in  a  consistent  order.  The  effect  is 
to  serialize  failure  and  recovery  decisions  relative  to  other  events  in  the  system. 
With  such  a  communication  sub-system,  problems  such  as  those  described  earlier 
do  not  arise  when  a  piggybacked  or  concurrent  implementation  is  used.  An  addi¬ 
tional  advantage  of  using  this  communication  sub-system  is  that  it  makes  recovery 
protocols  very  simple  to  describe  and  program.  These  advantages  have  prompted 
us  to  re-implement  the  ISIS  communication  sub-system  along  these  lines. 

6.4.  Conclusion 

We  have  indicated  how  the  piggybacked  or  concurrent  implementations  of  dis¬ 
tributed  objects  may  be  integrated  with  a  roll-back  or  a  roll-forward  recovery 
mechanism.  With  the  former,  little  modification  is  required.  The  latter  requires  a 
protocol  that  ensures  that  all  sites  have  a  consistent  view  of  the  timing  and  order 
of  failures  and  recoveries.  The  existence  of  such  protocols  justifies  our  claim  that 
the  methods  presented  here  are  applicable  in  a  wide  range  of  fault-tolerant  sys¬ 
tems,  and  not  just  fault-intolerant  ones  that  happen  to  be  distributed. 


CHAPTER  7 


Performance 


7.1.  Introduction 

The  piggybacked  or  the  concurrent  implementation  of  a  distributed  object 
offers  an  advantage  over  a  synchronous  one  because  it  does  not  incur  a  latency 
cost.  To  obtain  a  quantitative  measure  of  this  advantage,  we  compared  the  perfor¬ 
mance  of  a  concurrent  implementation  with  that  of  a  synchronous  one.  Rather 
than  building  a  system  of  distributed  objects  from  scratch,  we  modified  an  already 
existing  prototype  of  the  ISIS  system  and  carried  out  our  measurements  in  this 
context.  The  figures  we  present  should  not  be  taken  as  an  absolute  measure  of  the 
performance  obtainable  from  a  concurrent  implementation,  because  they  include  a 
large  overhead  arising  from  the  ISIS  system  itself.  They  are,  instead,  intended  to 
compare  the  performance  of  the  concurrent  and  synchronous  implementations, 
other  things  being  equal.  We  are  pleased  to  report  that  the  concurrent  implemen¬ 
tation  performed  significantly  better.  When  stripped  of  the  overhead  of  the  ISIS 
system,  performance  gains  should  be  even  higher. 

7.2.  The  ISIS  system 

The  ISIS  system,  under  development  at  Cornell,  aims  to  aid  the  construction 
of  fault-tolerant  software.  It  automatically  converts  fault-intolerant  specifications 
of  objects  into  fault-tolerant  implementations.  ISIS  operates  by  transforming  an 
object  specification  into  a  resilient  object :  an  implementation  of  a  distributed  object 
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Figure  7.1.  ISIS  system  architecture 


that  continues  to  accept  and  respond  to  invocations  despite  site  failures.  In  a  resi¬ 
lient  object,  data  and  programs  are  replicated  at  more  than  one  site  and  actions  at 
these  sites  are  coordinated  in  a  way  that  enables  operational  sites  to  continue  to 
operate  correctly  even  if  one  or  more  sites  fail.  The  details  of  replication,  the  proto¬ 
cols  used  to  maintain  the  consistency  of  replicated  data,  and  the  failure  handling 
mechanisms  are  all  hidden  from  a  user  of  ISIS.  A  user  specifies  a  resilient  object 


in  much  the  same  way  as  he  or  she  would  program  a  fault-intolerant  object.  This 
enables  a  relatively  inexpert  user  to  build  a  fault-tolerant  application.  Details  of 
the  ISIS  system  can  be  found  in  [6,  7,  9,  10]. 

7.3.  The  ISIS  prototype 

A  prototype  of  the  ISIS  system  has  been  operational  since  January  1985.  It 
runs  on  top  of  4.2  UNIX  on  a  cluster  of '5  SUN  2/50  work-stations,  interconnected 
by  a  10  Mbit's  Ethernet.  Figure  7.1  shows  the  hierarchical  architecture  of  the  ISIS 
system.  The  lowest  level  uses  a  site-to-site  windowed  acknowledgement  protocol  to 
provide  sequenced,  (almost)  error-free  message  transmission.  It  detects  site  failures 
using  timeouts  and  runs  a  protocol  to  ensure  that  all  sites  observe  site  failures  and 
recoveries  in  a  consistent  order.  On  top  of  this  layer  are  system  utilities,  which  are 
used  to  implement  broadcast  primitives  with  different  ordering  properties.  Resi¬ 
lient  objects  reside  on  top  of  this  layer.  Some  of  the  system  services  like  a  name- 
space  and  a  capability  manager  are  built  as  resilient  objects. 

In  UNIX,  processes  and  inter-process  communication  are  relatively  expensive. 
Hence,  a  single  system  process  is  used  at  each  site  to  handle  functions  common  to 
all  resilient  objects.  Also,  only  one  process  is  used  at  each  site  to  implement  all 
objects  of  the  same  type,  as  in  [34],  This  process  is  called  a  type  manager  and  mul¬ 
tiplexes  its  time  among  all  the  objects  of  that  type.  A  new  process  is  created  only 
when  a  new  object  type  is  installed  at  a  site.  Utilities  to  load  and  unload  type 


7.4.  Performance  measurements 


To  measure  the  performance  of  the  concurrent  implementation,  the  system 
was  configured  so  that  objects  could  function  either  in  a  synchronous  or  in  a  con¬ 
current  mode.  In  the  synchronous  mode,  a  message  is  transmitted  to  all  com¬ 
ponents  of  the  object  for  each  operation  (or  sub-operation),  and  the  next  operation  is 
executed  only  after  an  acknowledgement  is  received.  This  ensures  a  consistent 
order  of  execution  at  each  component.  In  the  concurrent  mode,  the  method  of  Sec¬ 
tion  5.3  (the  concurrent  implementation)  is  used  for  all  resilient  objects.  The 
implementation  in  ISIS  differs  from  the  method  as  presented  here  in  one  respect. 
Because  of  the  way  ISIS  is  constructed,  each  operation  is  broken  up  into  read  and 
write  sub-operations  before  being  executed.  This  results  in  more  than  one 
notification  being  sent  for  each  invocation.  The  additional  message  overhead 
makes  the  performance  of  the  concurrent  implementation  appear  poorer  than  it 
should  be.  Table  7.1  presents  some  general  performance  measurements  for  the  sys¬ 
tem,  to  put  the  later  figures  for  object  implementations  into  perspective. 
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In  Table  7.2,  we  see  the  results  of  measurements  made  on  four  objects:  a  file 
with  read  and  write  operations,  a  directory  with  bind  and  lookup  operations,  a 
stack  with  push  and  pop  operations,  and  an  indexed  file  with  add  and  lookup 
operations.  In  each  case,  we  measured  the  average  time  to  execute  25  invocations 
of  the  designated  operation  in  both  the  modes.  The  read  operation  for  the  file 
object  and  the  lookup  operations  for  the  directory  and  indexed  file  objects  are 
read-only  operations.  They  are  implemented  in  a  way  that  does  not  require  access¬ 
ing  a  remote  copy,  hence,  the  concurrent  implementation  does  not  affect  perfor¬ 
mance.  The  first  time  one  of  the  other  operations  is  executed,  ISIS  requires  that  a 


Table  7.2.  Performance  of  the  concurrent  implementation 
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write  lock  be  acquired  at  remote  components.  The  time  taken  to  acquire  the  lock  is 
averaged  over  the  remaining  24  invocations,  which  occur  within  the  same  transac¬ 
tion  and  do  not  have  to  wait. 

7.5.  Discussion 

The  figures  show  that  the  concurrent  implementation  results  in  a  significant 
improvement  in  performance  when  compared  with  a  synchronous  implementation. 
Performance  gains  varied  from  200^  to  1000^  in  terms  of  the  number  of  invoca¬ 
tions  executed  per  second.  Our  figures  for  the  concurrent  mode  are  better  even  in 
the  single  site  case  because  the  concurrent  implementation  was  used  to  maintain 
message  routing  tables  in  the  type  managers.  An  important  observation  is  that  the 
performance  of  the  synchronous  implementation  degrades  rapidly  as  the  number  of 
sites  is  increased,  while  that  of  the  concurrent  implementation  decreases  only 
slightly.  This  means  that  the  perceived  performance  of  a  replicated  object  accessi¬ 
ble  from  a  number  of  sites  can  be  made  comparable  to  that  of  a  single-site  (non- 
replicated)  object.  Although  these  results  are  not  surprising  in  the  light  of  the  way 
in  which  the  concurrent  implementation  works,  it  is  indeed  encouraging  to  see  that 
the  constraints  of  a  real-life  system  do  not  invalidate  our  expectations. 

One  interesting  observation  that  resulted  from  our  tests  was  that  as  the  rate 
of  performing  operations  was  increased,  the  buffering  of  notifications  became  a 
bottleneck.  This  is  in  keeping  with  our  remark  in  Chapter  4  that  message  schedul¬ 
ing  could  have  an  important  influence  on  the  performance  of  the  concurrent  imple¬ 


mentation. 
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These  experiments  by  no  means  constitute  an  extensive  test  of  the  method. 
Nevertheless,  they  suffice  to  show  that  considerable  performance  benefits  can  be 
obtained  from  the  concurrent  implementation  of  replicated  objects.  The  measure¬ 
ments  include  the  overhead  of  the  ISIS  system.  Furthermore,  because  the  ISIS 
system  is  still  only  a  prototype  and  is  built  on  top  of  UNIX,  it  is  not  tuned  for  high 
performance.  The  multi-process  structure  imposes  a  substantial  scheduling  and 
inter-process  communication  overhead.  Also,  the  performance  of  the  remote  pro¬ 
cedure  call  connections  is  suboptimai,  primarily  because  the  SUN  version  of  UNIX 
does  not  support  changes  to  the  IPC  buffer  size.  Even  better  results  can  be 
expected  in  a  system  that  is  fine  tuned  for  high  performance. 
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CHAPTER  8 


Conclusion 


8.1.  What  have  we  achieved? 

The  main  result  of  this  work  has  been  to  demonstrate  that  data  may  be  repli¬ 
cated  in  a  distributed  system  without  incurring  the  cost  of  latency.  We  first 
presented  a  general  model  for  distributed  systems  and  extended  the  database  con¬ 
cepts  of  conflicting  operations  and  serializability.  We  then  used  this  model  to  show 
that  in  a  replicated  object,  operations  on  remote  copies  of  data  may  be  decoupled 
from  local  operations,  so  that  the  perceived  performance  is  comparable  to  that  of  a 
non-replicated  object.  In  addition,  we  showed  that  the  implementation  can  be  done 
in  a  way  that  requires  no  more  messages  than  a  single-site  object  accessed  from 
more  than  one  site.  Finally,  we  validated  our  claims  by  testing  out  the  method  on 
an  actual  system. 

Our  work  shows  that  if  data  are  being  accessed  from  multiple  sites  in  a  distri¬ 
buted  system,  then  it  can  be  replicated  at  these  sites,  while  attaining  high  levels  of 
concurrency.  The  implication  is  that  in  such  a  system,  the  only  costs  that  must  be 
incurred  in  order  to  obtain  high  availability  and  fault-tolerance  by  replication  are 
the  costs  of  storing  more  than  one  copy  of  data  and  local  processing  costs. 


8.2.  Where  do  we  go  from  here? 

There  are  two  general  directions  in  which  this  work  may  be  continued.  The 
first  relates  to  the  choice  of  serializability  as  our  correctness  criterion.  The  other 
deals  with  the  restriction  to  asynchronous  systems. 

8.2.1.  Correctness  constraints 

It  has  been  observed  that  in  some  situations,  serializability  is  too  restrictive  a 
criterion  for  correctness  [21,  30].  We  have  adopted  serializability  as  our  correctness 
constraint  for  the  following  reasons.  To  begin  with,  our  notion  of  serializability  is 
more  general  than  in  database  literature  -  it  is  really  a  requirement  that  there  be 
some  form  of  atomicity  in  the  system.  Atomicity  is  a  natural  and  simple  constraint 
to  require  in  a  distributed  system.  A  closer  look  at  our  results,  however,  reveals 
that  our  methods  of  replicating  data  depend  only  on  the  fact  that  conflicting  opera¬ 
tions  are  ordered  relative  to  each  other.  They  do  not  require  that  this  ordering 
occurs  in  a  way  that  leads  to  serializability.  This  opens  the  possibility  that  the 
methods  can  be  used  without  change  in  a  system  where  a  weaker  form  of  correct¬ 
ness  is  being  enforced.  Characterizing  this  weaker  form  of  correctness  formally 
and  demonstrating  situations  where  such  a  correctness  constraint  may  be  useful 
remains  an  area  for  future  work.  The  inverse  problem  -  that  of  adapting  the 
methods  given  here  to  suit  a  system  with  a  given  'weaken  correctness  constraint  — 


is  another  such  area. 
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8.2.2.  What  if  the  system  is  synchronous? 

In  an  asynchronous  system,  the  only  way  in  which  a  site  becomes  aware  of  the 
completion  of  an  event  at  another  site  is  by  receiving  a  message  from  the  latter 
site.  Our  implementations  make  use  of  these  messages  to  transfer  information 
needed  to  keep  the  copies  of  replicated  data  consistent.  In  a  system  where  a  syn¬ 
chronized  clock  is  available,  the  mere  fact  that  a  certain  period  of  time  has  elapsed 
could  be  used  to  infer  that  an  event  has  taken  place  at  a  remote  site.  For  example, 
if  all  sites  make  backups  of  their  respective  file  systems  at  5  o’clock  every  day,  and 
if  backup  operations  take  no  longer  than  half  an  hour,  then  at  5:30  every  site 
knows  that  every  other  site  must  have  completed  a  backup  for  the  day.  Thus  in  a 
synchronous  system,  information  can  be  gathered  without  explicit  message 
transfer.  In  [33],  Lamport  discusses  the  relation  between  using  synchronized  clocks 
and  using  messages  to  order  events. 

An  assumption  in  the  example  above  is  that  the  behavior  of  every  site  is 
"known”  to  every  other  site,  otherwise  they  could  not  have  concluded  that  the  back¬ 
ups  were  done.  The  concept  of  "knowledge”  in  a  distributed  system  has  been  stu¬ 
died  in  [13,  24],  It  is  shown  that  the  information  available  to  a  site  in  a  distributed 
system  is  a  consequence  of  "common  knowledge”  at  the  time  the  system  is  initial¬ 
ized  and  of  knowledge  gained  by  the  transfer  of  messages  between  sites.  It  would 
be  interesting  to  study  whether  the  results  presented  here  can  be  expressed  on  the 
basis  of  "knowledge  transfer"  rather  than  message  transfer,  and  thus  made  applica¬ 
ble  to  synchronous  systems. 
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